2025-12-04T09:42:13.4468123Z Current runner version: '2.330.0' 2025-12-04T09:42:13.4474076Z Runner name: 'i-07df7d64debf86ede' 2025-12-04T09:42:13.4474805Z Runner group name: 'default' 2025-12-04T09:42:13.4475682Z Machine name: 'ip-10-0-6-74' 2025-12-04T09:42:13.4478713Z ##[group]GITHUB_TOKEN Permissions 2025-12-04T09:42:13.4480846Z Contents: read 2025-12-04T09:42:13.4481528Z Metadata: read 2025-12-04T09:42:13.4481998Z ##[endgroup] 2025-12-04T09:42:13.4484200Z Secret source: Actions 2025-12-04T09:42:13.4484943Z Prepare workflow directory 2025-12-04T09:42:13.4959655Z Prepare all required actions 2025-12-04T09:42:13.4993546Z Getting action download info 2025-12-04T09:42:13.8238721Z Download action repository 'pytorch/test-infra@main' (SHA:39aa74d619174326f4e2fb0e216151c2f29d9ffd) 2025-12-04T09:42:16.4135527Z Download action repository 'pytorch/pytorch@main' (SHA:7716da9fb23f27a65b41f9f016a2afadf281c18f) 2025-12-04T09:42:33.2774860Z Download action repository 'actions/setup-python@a26af69be951a213d495a4c3e4e4022e16d87065' (SHA:a26af69be951a213d495a4c3e4e4022e16d87065) 2025-12-04T09:42:33.6635018Z Download action repository 'aws-actions/configure-aws-credentials@ececac1a45f3b08a01d2dd070d28d111c5fe6722' (SHA:ececac1a45f3b08a01d2dd070d28d111c5fe6722) 2025-12-04T09:42:33.9723477Z Download action repository 'aws-actions/amazon-ecr-login@062b18b96a7aff071d4dc91bc00c4c1a7945b076' (SHA:062b18b96a7aff071d4dc91bc00c4c1a7945b076) 2025-12-04T09:42:34.1659897Z Download action repository 'seemethere/download-artifact-s3@1da556a7aa0a088e3153970611f6c432d58e80e6' (SHA:1da556a7aa0a088e3153970611f6c432d58e80e6) 2025-12-04T09:42:34.4281363Z Download action repository 'seemethere/upload-artifact-s3@baba72d0712b404f646cebe0730933554ebce96a' (SHA:baba72d0712b404f646cebe0730933554ebce96a) 2025-12-04T09:42:34.7368977Z Getting action download info 2025-12-04T09:42:34.9295440Z Download action repository 'actions/checkout@v4' (SHA:34e114876b0b11c390a56381ad16ebd13914f8d5) 2025-12-04T09:42:35.2809188Z Getting action download info 2025-12-04T09:42:35.4374716Z Download action repository 'nick-fields/retry@v3.0.0' (SHA:7152eba30c6575329ac0576536151aca5a72780e) 2025-12-04T09:42:35.6715559Z Getting action download info 2025-12-04T09:42:35.8040573Z Download action repository 'nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482' (SHA:3e91a01664abd3c5cd539100d10d33b9c5b68482) 2025-12-04T09:42:36.0457570Z Getting action download info 2025-12-04T09:42:36.2198120Z Uses: pytorch/pytorch/.github/workflows/_linux-test.yml@refs/heads/main (ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32) 2025-12-04T09:42:36.2201663Z ##[group] Inputs 2025-12-04T09:42:36.2201995Z build-environment: linux-jammy-cuda12.8-py3.10-gcc11-debug 2025-12-04T09:42:36.2208380Z test-matrix: {"include": [{"config": "default", "shard": 1, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 6, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 6, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 7, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 7, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}]} 2025-12-04T09:42:36.2215120Z docker-image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:42:36.2215721Z sync-tag: 2025-12-04T09:42:36.2216373Z timeout-minutes: 240 2025-12-04T09:42:36.2216573Z use-gha: 2025-12-04T09:42:36.2216726Z dashboard-tag: 2025-12-04T09:42:36.2216901Z s3-bucket: gha-artifacts 2025-12-04T09:42:36.2217098Z aws-role-to-assume: 2025-12-04T09:42:36.2217573Z disable-monitor: false 2025-12-04T09:42:36.2217836Z monitor-log-interval: 5 2025-12-04T09:42:36.2218060Z monitor-data-collect-interval: 1 2025-12-04T09:42:36.2218293Z ##[endgroup] 2025-12-04T09:42:36.2218841Z Complete job name: linux-jammy-cuda12.8-py3.10-gcc11-debug / test (default, 1, 7, linux.g6.4xlarge.experimental.nvidia.gpu, oncall:debug-build, rerun_disabled... 2025-12-04T09:42:36.2892722Z A job started hook has been configured by the self-hosted runner administrator 2025-12-04T09:42:36.2985836Z ##[group]Run '/home/ec2-user/runner-scripts/before_job.sh' 2025-12-04T09:42:36.2996266Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:42:36.2996807Z ##[endgroup] 2025-12-04T09:42:37.5727501Z Runner Type: linux.g6.4xlarge.experimental.nvidia.gpu 2025-12-04T09:42:37.5727953Z Instance Type: g6.4xlarge 2025-12-04T09:42:37.5728157Z AMI Name: unknown 2025-12-04T09:42:37.5765741Z AMI ID: ami-08982f1c5bf93d976 2025-12-04T09:42:42.5002159Z ##[group]Run pytorch/test-infra/.github/actions/setup-ssh@main 2025-12-04T09:42:42.5002520Z with: 2025-12-04T09:42:42.5003014Z github-secret: *** 2025-12-04T09:42:42.5003556Z instructions: All testing is done inside the container, to start an interactive session run: docker exec -it $(docker container ps --format '{{.ID}}') bash 2025-12-04T09:42:42.5004128Z activate-with-label: false 2025-12-04T09:42:42.5004352Z label: with-ssh 2025-12-04T09:42:42.5004527Z remove-existing-keys: true 2025-12-04T09:42:42.5004742Z fail-silently: true 2025-12-04T09:42:42.5004951Z env: 2025-12-04T09:42:42.5005096Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:42:42.5005297Z ##[endgroup] 2025-12-04T09:42:42.6349076Z Please see https://github.com/pytorch/pytorch/wiki/Debugging-using-with-ssh-for-Github-Actions for more info. 2025-12-04T09:42:42.6350834Z Not on pull request and ciflow reference could not be extracted, skipping adding ssh keys 2025-12-04T09:42:42.6497164Z ##[group]Run pytorch/pytorch/.github/actions/checkout-pytorch@main 2025-12-04T09:42:42.6497517Z with: 2025-12-04T09:42:42.6497682Z no-sudo: true 2025-12-04T09:42:42.6497856Z submodules: recursive 2025-12-04T09:42:42.6498047Z fetch-depth: 0 2025-12-04T09:42:42.6498211Z env: 2025-12-04T09:42:42.6498368Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:42:42.6498557Z ##[endgroup] 2025-12-04T09:42:42.6563381Z ##[group]Run echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T09:42:42.6564439Z echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T09:42:42.6576987Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:42:42.6577278Z env: 2025-12-04T09:42:42.6577462Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:42:42.6577683Z ##[endgroup] 2025-12-04T09:42:42.6658306Z ##[group]Run # Use all available CPUs for fetching 2025-12-04T09:42:42.6658632Z # Use all available CPUs for fetching 2025-12-04T09:42:42.6658887Z cd "${GITHUB_WORKSPACE}" 2025-12-04T09:42:42.6659146Z git config --global fetch.parallel 0 2025-12-04T09:42:42.6659440Z git config --global submodule.fetchJobs 0 2025-12-04T09:42:42.6659685Z  2025-12-04T09:42:42.6659995Z # Clean workspace. The default checkout action should also do this, but 2025-12-04T09:42:42.6660371Z # do it here as well just in case 2025-12-04T09:42:42.6660602Z if [[ -d .git ]]; then 2025-12-04T09:42:42.6660817Z  if [ -z "${NO_SUDO}" ]; then 2025-12-04T09:42:42.6661038Z  sudo git clean -ffdx 2025-12-04T09:42:42.6661235Z  else 2025-12-04T09:42:42.6661404Z  git clean -ffdx 2025-12-04T09:42:42.6681438Z  fi 2025-12-04T09:42:42.6681662Z fi 2025-12-04T09:42:42.6689280Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:42:42.6689575Z env: 2025-12-04T09:42:42.6689736Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:42:42.6689927Z NO_SUDO: true 2025-12-04T09:42:42.6690093Z ##[endgroup] 2025-12-04T09:42:42.6806514Z ##[group]Run actions/checkout@v4 2025-12-04T09:42:42.6806740Z with: 2025-12-04T09:42:42.6806929Z ref: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:42:42.6807175Z fetch-depth: 0 2025-12-04T09:42:42.6807345Z submodules: recursive 2025-12-04T09:42:42.6807556Z show-progress: false 2025-12-04T09:42:42.6807754Z repository: pytorch/pytorch 2025-12-04T09:42:42.6808080Z token: *** 2025-12-04T09:42:42.6808241Z ssh-strict: true 2025-12-04T09:42:42.6808408Z ssh-user: git 2025-12-04T09:42:42.6808581Z persist-credentials: true 2025-12-04T09:42:42.6808768Z clean: true 2025-12-04T09:42:42.6808941Z sparse-checkout-cone-mode: true 2025-12-04T09:42:42.6809150Z fetch-tags: false 2025-12-04T09:42:42.6809317Z lfs: false 2025-12-04T09:42:42.6809478Z set-safe-directory: true 2025-12-04T09:42:42.6809670Z env: 2025-12-04T09:42:42.6809815Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:42:42.6809996Z ##[endgroup] 2025-12-04T09:42:42.7836214Z Syncing repository: pytorch/pytorch 2025-12-04T09:42:42.7837423Z ##[group]Getting Git version info 2025-12-04T09:42:42.7837813Z Working directory is '/home/ec2-user/actions-runner/_work/pytorch/pytorch' 2025-12-04T09:42:42.7838307Z [command]/usr/bin/git version 2025-12-04T09:42:42.8036295Z git version 2.50.1 2025-12-04T09:42:42.8059711Z ##[endgroup] 2025-12-04T09:42:42.8069118Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/2224e379-5159-4c11-80aa-77cc183ca70a/.gitconfig' 2025-12-04T09:42:42.8089059Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/2224e379-5159-4c11-80aa-77cc183ca70a' before making global git config changes 2025-12-04T09:42:42.8089962Z Adding repository directory to the temporary git global config as a safe directory 2025-12-04T09:42:42.8093584Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/pytorch/pytorch 2025-12-04T09:42:42.8140724Z Deleting the contents of '/home/ec2-user/actions-runner/_work/pytorch/pytorch' 2025-12-04T09:42:42.8144380Z ##[group]Initializing the repository 2025-12-04T09:42:42.8148664Z [command]/usr/bin/git init /home/ec2-user/actions-runner/_work/pytorch/pytorch 2025-12-04T09:42:42.8234581Z hint: Using 'master' as the name for the initial branch. This default branch name 2025-12-04T09:42:42.8236528Z hint: is subject to change. To configure the initial branch name to use in all 2025-12-04T09:42:42.8237668Z hint: of your new repositories, which will suppress this warning, call: 2025-12-04T09:42:42.8239041Z hint: 2025-12-04T09:42:42.8239644Z hint: git config --global init.defaultBranch 2025-12-04T09:42:42.8240025Z hint: 2025-12-04T09:42:42.8240358Z hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 2025-12-04T09:42:42.8240862Z hint: 'development'. The just-created branch can be renamed via this command: 2025-12-04T09:42:42.8241234Z hint: 2025-12-04T09:42:42.8241424Z hint: git branch -m 2025-12-04T09:42:42.8241652Z hint: 2025-12-04T09:42:42.8242002Z hint: Disable this message with "git config set advice.defaultBranchName false" 2025-12-04T09:42:42.8245369Z Initialized empty Git repository in /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/ 2025-12-04T09:42:42.8256323Z [command]/usr/bin/git remote add origin https://github.com/pytorch/pytorch 2025-12-04T09:42:42.8300065Z ##[endgroup] 2025-12-04T09:42:42.8300472Z ##[group]Disabling automatic garbage collection 2025-12-04T09:42:42.8302732Z [command]/usr/bin/git config --local gc.auto 0 2025-12-04T09:42:42.8329824Z ##[endgroup] 2025-12-04T09:42:42.8330206Z ##[group]Setting up auth 2025-12-04T09:42:42.8335469Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-12-04T09:42:42.8364420Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-12-04T09:42:42.8751648Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-12-04T09:42:42.8780408Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-12-04T09:42:42.9147416Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T09:42:42.9177284Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2025-12-04T09:42:42.9520667Z [command]/usr/bin/git config --local http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-12-04T09:42:42.9575577Z ##[endgroup] 2025-12-04T09:42:42.9576003Z ##[group]Fetching the repository 2025-12-04T09:42:42.9584290Z [command]/usr/bin/git -c protocol.version=2 fetch --prune --no-recurse-submodules origin +refs/heads/*:refs/remotes/origin/* +refs/tags/*:refs/tags/* 2025-12-04T09:43:32.0070072Z From https://github.com/pytorch/pytorch 2025-12-04T09:43:32.0070987Z * [new branch] 2.6.0.dev20241004+ -> origin/2.6.0.dev20241004+ 2025-12-04T09:43:32.0071639Z * [new branch] 2.9.1 -> origin/2.9.1 2025-12-04T09:43:32.0072317Z * [new branch] AaronWang04_addmmfusion_perftest -> origin/AaronWang04_addmmfusion_perftest 2025-12-04T09:43:32.0072888Z * [new branch] Flamefire-patch-1 -> origin/Flamefire-patch-1 2025-12-04T09:43:32.0074115Z * [new branch] HDCharles-2.6.0-release-notes -> origin/HDCharles-2.6.0-release-notes 2025-12-04T09:43:32.0075763Z * [new branch] HOPrintFunc -> origin/HOPrintFunc 2025-12-04T09:43:32.0079145Z * [new branch] IvanKobzarev/stack/1 -> origin/IvanKobzarev/stack/1 2025-12-04T09:43:32.0081874Z * [new branch] NicoshevSVE128 -> origin/NicoshevSVE128 2025-12-04T09:43:32.0083531Z * [new branch] PR-AOTInductorNoneBug -> origin/PR-AOTInductorNoneBug 2025-12-04T09:43:32.0085344Z * [new branch] PR-AOTInductorNoneBugFix -> origin/PR-AOTInductorNoneBugFix 2025-12-04T09:43:32.0087190Z * [new branch] PR-FixConfigsIssue -> origin/PR-FixConfigsIssue 2025-12-04T09:43:32.0088857Z * [new branch] PR-NoneBugFix-viable -> origin/PR-NoneBugFix-viable 2025-12-04T09:43:32.0090926Z * [new branch] PR-ResetToZero -> origin/PR-ResetToZero 2025-12-04T09:43:32.0092790Z * [new branch] Update-Flash-Packaging -> origin/Update-Flash-Packaging 2025-12-04T09:43:32.0094631Z * [new branch] VLA_exp -> origin/VLA_exp 2025-12-04T09:43:32.0096585Z * [new branch] activation_bench -> origin/activation_bench 2025-12-04T09:43:32.0099016Z * [new branch] addmm-heuristic -> origin/addmm-heuristic 2025-12-04T09:43:32.0101509Z * [new branch] adi/onednn_aarch64 -> origin/adi/onednn_aarch64 2025-12-04T09:43:32.0103295Z * [new branch] adi/test -> origin/adi/test 2025-12-04T09:43:32.0105133Z * [new branch] adi/test_bgemm -> origin/adi/test_bgemm 2025-12-04T09:43:32.0106967Z * [new branch] adi/test_m8g -> origin/adi/test_m8g 2025-12-04T09:43:32.0108893Z * [new branch] adi/test_onednn -> origin/adi/test_onednn 2025-12-04T09:43:32.0110753Z * [new branch] adi/test_onednn_v3.9 -> origin/adi/test_onednn_v3.9 2025-12-04T09:43:32.0112469Z * [new branch] adi/test_presve_change -> origin/adi/test_presve_change 2025-12-04T09:43:32.0114174Z * [new branch] adi/test_timm -> origin/adi/test_timm 2025-12-04T09:43:32.0116287Z * [new branch] adi/testpresve_change -> origin/adi/testpresve_change 2025-12-04T09:43:32.0119427Z * [new branch] aditew01/test/vec_bf16 -> origin/aditew01/test/vec_bf16 2025-12-04T09:43:32.0121282Z * [new branch] ah-globalfeedback-hook -> origin/ah-globalfeedback-hook 2025-12-04T09:43:32.0123336Z * [new branch] albanD-patch-1 -> origin/albanD-patch-1 2025-12-04T09:43:32.0124970Z * [new branch] also-surround-shimh -> origin/also-surround-shimh 2025-12-04T09:43:32.0127614Z * [new branch] angelayi/aot_compile -> origin/angelayi/aot_compile 2025-12-04T09:43:32.0129515Z * [new branch] angelayi/aoti_additional_files -> origin/angelayi/aoti_additional_files 2025-12-04T09:43:32.0131078Z * [new branch] angelayi/benchmark -> origin/angelayi/benchmark 2025-12-04T09:43:32.0133028Z * [new branch] angelayi/change_pytree_serialization -> origin/angelayi/change_pytree_serialization 2025-12-04T09:43:32.0134553Z * [new branch] angelayi/cpp_loader -> origin/angelayi/cpp_loader 2025-12-04T09:43:32.0136482Z * [new branch] angelayi/inductor_const -> origin/angelayi/inductor_const 2025-12-04T09:43:32.0138220Z * [new branch] angelayi/lstm -> origin/angelayi/lstm 2025-12-04T09:43:32.0140439Z * [new branch] angelayi/no_so_weight -> origin/angelayi/no_so_weight 2025-12-04T09:43:32.0142703Z * [new branch] angelayi/scan_layers -> origin/angelayi/scan_layers 2025-12-04T09:43:32.0144514Z * [new branch] angelayi/side_eff -> origin/angelayi/side_eff 2025-12-04T09:43:32.0146368Z * [new branch] angelayi/state_dict -> origin/angelayi/state_dict 2025-12-04T09:43:32.0148311Z * [new branch] angelayi/symint_input -> origin/angelayi/symint_input 2025-12-04T09:43:32.0150348Z * [new branch] angelayi/symm_mem -> origin/angelayi/symm_mem 2025-12-04T09:43:32.0152037Z * [new branch] angelayi/test_cpp -> origin/angelayi/test_cpp 2025-12-04T09:43:32.0153870Z * [new branch] angelayi/torch_size -> origin/angelayi/torch_size 2025-12-04T09:43:32.0155832Z * [new branch] annotate_assert -> origin/annotate_assert 2025-12-04T09:43:32.0157788Z * [new branch] annotate_fallback_kernel -> origin/annotate_fallback_kernel 2025-12-04T09:43:32.0159535Z * [new branch] annotation_deepcopy -> origin/annotation_deepcopy 2025-12-04T09:43:32.0161455Z * [new branch] annotation_dynamo -> origin/annotation_dynamo 2025-12-04T09:43:32.0163238Z * [new branch] aot_eager_stack_trace -> origin/aot_eager_stack_trace 2025-12-04T09:43:32.0165025Z * [new branch] aoti-cuda-alloc -> origin/aoti-cuda-alloc 2025-12-04T09:43:32.0166833Z * [new branch] aoti_const_device -> origin/aoti_const_device 2025-12-04T09:43:32.0168626Z * [new branch] aoti_fqn_name_interface -> origin/aoti_fqn_name_interface 2025-12-04T09:43:32.0170371Z * [new branch] aoti_package_weights_binary -> origin/aoti_package_weights_binary 2025-12-04T09:43:32.0172150Z * [new branch] aoti_target_windows -> origin/aoti_target_windows 2025-12-04T09:43:32.0175246Z * [new branch] arsh/feat/inductor_check_profiling -> origin/arsh/feat/inductor_check_profiling 2025-12-04T09:43:32.0176910Z * [new branch] async_tp -> origin/async_tp 2025-12-04T09:43:32.0178944Z * [new branch] atalman-inductor-perf-cu124 -> origin/atalman-inductor-perf-cu124 2025-12-04T09:43:32.0180797Z * [new branch] atalman-inductor-perf-cu124.1 -> origin/atalman-inductor-perf-cu124.1 2025-12-04T09:43:32.0182642Z * [new branch] atalman-patch-2 -> origin/atalman-patch-2 2025-12-04T09:43:32.0184578Z * [new branch] atalman-patch-3 -> origin/atalman-patch-3 2025-12-04T09:43:32.0186368Z * [new branch] atalman-patch-4 -> origin/atalman-patch-4 2025-12-04T09:43:32.0188347Z * [new branch] atalman-patch-5 -> origin/atalman-patch-5 2025-12-04T09:43:32.0190185Z * [new branch] atalman-patch-6 -> origin/atalman-patch-6 2025-12-04T09:43:32.0192075Z * [new branch] atalman-patch-7 -> origin/atalman-patch-7 2025-12-04T09:43:32.0193895Z * [new branch] atalman-patch-8 -> origin/atalman-patch-8 2025-12-04T09:43:32.0195739Z * [new branch] atalman_inductor_2.3.1 -> origin/atalman_inductor_2.3.1 2025-12-04T09:43:32.0197555Z * [new branch] atalman_inductor_2.4.0 -> origin/atalman_inductor_2.4.0 2025-12-04T09:43:32.0199358Z * [new branch] atalman_inductor_2.4.x -> origin/atalman_inductor_2.4.x 2025-12-04T09:43:32.0201325Z * [new branch] attention_benchmarking_clean -> origin/attention_benchmarking_clean 2025-12-04T09:43:32.0203635Z * [new branch] bahuang/dt_fix_scalar_add -> origin/bahuang/dt_fix_scalar_add 2025-12-04T09:43:32.0205389Z * [new branch] bahuang/fix_debug_mode -> origin/bahuang/fix_debug_mode 2025-12-04T09:43:32.0207149Z * [new branch] bahuang/fix_expand -> origin/bahuang/fix_expand 2025-12-04T09:43:32.0208958Z * [new branch] bahuang/test -> origin/bahuang/test 2025-12-04T09:43:32.0211523Z * [new branch] base/1.5 -> origin/base/1.5 2025-12-04T09:43:32.0213579Z * [new branch] batching_sdpa_efficient_attention -> origin/batching_sdpa_efficient_attention 2025-12-04T09:43:32.0215395Z * [new branch] bench_scaled_mm_ops -> origin/bench_scaled_mm_ops 2025-12-04T09:43:32.0217408Z * [new branch] benchmark-updates -> origin/benchmark-updates 2025-12-04T09:43:32.0219014Z * [new branch] benchmarking-script -> origin/benchmarking-script 2025-12-04T09:43:32.0221467Z * [new branch] bertmaher/pinbump26 -> origin/bertmaher/pinbump26 2025-12-04T09:43:32.0223861Z * [new branch] bertrand/cutlass -> origin/bertrand/cutlass 2025-12-04T09:43:32.0226336Z * [new branch] bf/bug-static-input -> origin/bf/bug-static-input 2025-12-04T09:43:32.0227981Z * [new branch] bf/cg-backend -> origin/bf/cg-backend 2025-12-04T09:43:32.0229890Z * [new branch] bf/cg-nccl-test -> origin/bf/cg-nccl-test 2025-12-04T09:43:32.0231573Z * [new branch] bf/cg-remove-check -> origin/bf/cg-remove-check 2025-12-04T09:43:32.0233389Z * [new branch] bf/clean-torchbench-hf -> origin/bf/clean-torchbench-hf 2025-12-04T09:43:32.0235009Z * [new branch] bf/combo-debug-log -> origin/bf/combo-debug-log 2025-12-04T09:43:32.0236787Z * [new branch] bf/cudagraph -> origin/bf/cudagraph 2025-12-04T09:43:32.0239074Z * [new branch] bf/cudagraph-disable-input-mutation -> origin/bf/cudagraph-disable-input-mutation 2025-12-04T09:43:32.0241115Z * [new branch] bf/cudagraph-enable-input-mutation-support-benchmark -> origin/bf/cudagraph-enable-input-mutation-support-benchmark 2025-12-04T09:43:32.0242594Z * [new branch] bf/cudagraph-partition -> origin/bf/cudagraph-partition 2025-12-04T09:43:32.0244587Z * [new branch] bf/donated-buffer-bench -> origin/bf/donated-buffer-bench 2025-12-04T09:43:32.0246494Z * [new branch] bf/dynamo-partition -> origin/bf/dynamo-partition 2025-12-04T09:43:32.0248243Z * [new branch] bf/lite -> origin/bf/lite 2025-12-04T09:43:32.0250055Z * [new branch] bf/pa-non-divisible -> origin/bf/pa-non-divisible 2025-12-04T09:43:32.0251991Z * [new branch] bf/partition-cache-free-symbols -> origin/bf/partition-cache-free-symbols 2025-12-04T09:43:32.0253803Z * [new branch] bf/partition-memory-plan -> origin/bf/partition-memory-plan 2025-12-04T09:43:32.0255864Z * [new branch] bf/partition-move-cpu -> origin/bf/partition-move-cpu 2025-12-04T09:43:32.0257883Z * [new branch] bf/partition-view-fallback -> origin/bf/partition-view-fallback 2025-12-04T09:43:32.0259629Z * [new branch] bf/remove-check-55b0c39d -> origin/bf/remove-check-55b0c39d 2025-12-04T09:43:32.0261469Z * [new branch] bf/timm-nov-26-2025 -> origin/bf/timm-nov-26-2025 2025-12-04T09:43:32.0263242Z * [new branch] bf/transformer-pin-4-57-3 -> origin/bf/transformer-pin-4-57-3 2025-12-04T09:43:32.0265125Z * [new branch] bisect_perf_hf_T5_3acc6eac492 -> origin/bisect_perf_hf_T5_3acc6eac492 2025-12-04T09:43:32.0266749Z * [new branch] bisect_perf_hf_T5_3fcf66f61fb -> origin/bisect_perf_hf_T5_3fcf66f61fb 2025-12-04T09:43:32.0268712Z * [new branch] bisect_perf_hf_T5_4009d154129 -> origin/bisect_perf_hf_T5_4009d154129 2025-12-04T09:43:32.0270359Z * [new branch] bisect_perf_hf_T5_40d0740e73d -> origin/bisect_perf_hf_T5_40d0740e73d 2025-12-04T09:43:32.0272210Z * [new branch] bisect_perf_hf_T5_5268754e -> origin/bisect_perf_hf_T5_5268754e 2025-12-04T09:43:32.0274012Z * [new branch] bisect_perf_hf_T5_7d89a8d385c -> origin/bisect_perf_hf_T5_7d89a8d385c 2025-12-04T09:43:32.0275698Z * [new branch] bisect_perf_hf_T5_b7a25c1ee7c -> origin/bisect_perf_hf_T5_b7a25c1ee7c 2025-12-04T09:43:32.0277458Z * [new branch] bisect_perf_hf_T5_c25b201583f -> origin/bisect_perf_hf_T5_c25b201583f 2025-12-04T09:43:32.0279238Z * [new branch] bisect_perf_hf_T5_c93e57efac0 -> origin/bisect_perf_hf_T5_c93e57efac0 2025-12-04T09:43:32.0281332Z * [new branch] bisect_perf_hf_T5_ca9813ea149 -> origin/bisect_perf_hf_T5_ca9813ea149 2025-12-04T09:43:32.0282743Z * [new branch] bisect_perf_hf_T5_d65f194a -> origin/bisect_perf_hf_T5_d65f194a 2025-12-04T09:43:32.0284624Z * [new branch] bisect_perf_hf_T5_da94ab0b -> origin/bisect_perf_hf_T5_da94ab0b 2025-12-04T09:43:32.0286427Z * [new branch] bisect_perf_hf_T5_da94ab0b_new -> origin/bisect_perf_hf_T5_da94ab0b_new 2025-12-04T09:43:32.0288019Z * [new branch] bisect_perf_hf_T5_db4e8a1d8a8 -> origin/bisect_perf_hf_T5_db4e8a1d8a8 2025-12-04T09:43:32.0289848Z * [new branch] bisect_perf_hf_T5_e0d97e936a2 -> origin/bisect_perf_hf_T5_e0d97e936a2 2025-12-04T09:43:32.0291610Z * [new branch] bisect_perf_hf_T5_f23621ec563 -> origin/bisect_perf_hf_T5_f23621ec563 2025-12-04T09:43:32.0294331Z * [new branch] brister/fx_device_type -> origin/brister/fx_device_type 2025-12-04T09:43:32.0296128Z * [new branch] brister/test_inductor_all_fx -> origin/brister/test_inductor_all_fx 2025-12-04T09:43:32.0297844Z * [new branch] brister/tiled_reduction_no_numel_check -> origin/brister/tiled_reduction_no_numel_check 2025-12-04T09:43:32.0299707Z * [new branch] bwd-backup -> origin/bwd-backup 2025-12-04T09:43:32.0301579Z * [new branch] c57382a49 -> origin/c57382a49 2025-12-04T09:43:32.0303378Z * [new branch] ca_0431d47eaa -> origin/ca_0431d47eaa 2025-12-04T09:43:32.0305056Z * [new branch] ca_fix_0431d47eaa -> origin/ca_fix_0431d47eaa 2025-12-04T09:43:32.0307688Z * [new branch] camyllh/test_setup_hooks_push -> origin/camyllh/test_setup_hooks_push 2025-12-04T09:43:32.0309669Z * [new branch] cccclai-patch-1 -> origin/cccclai-patch-1 2025-12-04T09:43:32.0311554Z * [new branch] cherry-pick-159969-by-pytorch_bot_bot_ -> origin/cherry-pick-159969-by-pytorch_bot_bot_ 2025-12-04T09:43:32.0313282Z * [new branch] cherry-pick-160586-by-pytorch_bot_bot_ -> origin/cherry-pick-160586-by-pytorch_bot_bot_ 2025-12-04T09:43:32.0315199Z * [new branch] cherry-pick-162208-by-pytorch_bot_bot_ -> origin/cherry-pick-162208-by-pytorch_bot_bot_ 2025-12-04T09:43:32.0316930Z * [new branch] cherry-pick-163169-by-pytorch_bot_bot_ -> origin/cherry-pick-163169-by-pytorch_bot_bot_ 2025-12-04T09:43:32.0318881Z * [new branch] cherry-pick-165086-by-pytorch_bot_bot_ -> origin/cherry-pick-165086-by-pytorch_bot_bot_ 2025-12-04T09:43:32.0320848Z * [new branch] cherry-pick-165514-by-pytorch_bot_bot_ -> origin/cherry-pick-165514-by-pytorch_bot_bot_ 2025-12-04T09:43:32.0322832Z * [new branch] cherry-pick-165601-by-pytorch_bot_bot_ -> origin/cherry-pick-165601-by-pytorch_bot_bot_ 2025-12-04T09:43:32.0324363Z * [new branch] cherry-pick-165667-by-pytorch_bot_bot_ -> origin/cherry-pick-165667-by-pytorch_bot_bot_ 2025-12-04T09:43:32.0326338Z * [new branch] cherry-pick-165815-by-pytorch_bot_bot_ -> origin/cherry-pick-165815-by-pytorch_bot_bot_ 2025-12-04T09:43:32.0328031Z * [new branch] cherry-pick-165922-by-pytorch_bot_bot_ -> origin/cherry-pick-165922-by-pytorch_bot_bot_ 2025-12-04T09:43:32.0330110Z * [new branch] cherry-pick-166148-by-pytorch_bot_bot_ -> origin/cherry-pick-166148-by-pytorch_bot_bot_ 2025-12-04T09:43:32.0331778Z * [new branch] cherry-pick-166181-by-pytorch_bot_bot_ -> origin/cherry-pick-166181-by-pytorch_bot_bot_ 2025-12-04T09:43:32.0333551Z * [new branch] cherry-pick-166404-by-pytorch_bot_bot_ -> origin/cherry-pick-166404-by-pytorch_bot_bot_ 2025-12-04T09:43:32.0335431Z * [new branch] cherry-pick-166427-by-pytorch_bot_bot_ -> origin/cherry-pick-166427-by-pytorch_bot_bot_ 2025-12-04T09:43:32.0337380Z * [new branch] cherry-pick-166480-by-pytorch_bot_bot_ -> origin/cherry-pick-166480-by-pytorch_bot_bot_ 2025-12-04T09:43:32.0339122Z * [new branch] cherry-pick-166570-by-pytorch_bot_bot_ -> origin/cherry-pick-166570-by-pytorch_bot_bot_ 2025-12-04T09:43:32.0340865Z * [new branch] cherry-pick-166993-by-pytorch_bot_bot_ -> origin/cherry-pick-166993-by-pytorch_bot_bot_ 2025-12-04T09:43:32.0342696Z * [new branch] cherry-pick-167111-by-pytorch_bot_bot_ -> origin/cherry-pick-167111-by-pytorch_bot_bot_ 2025-12-04T09:43:32.0344610Z * [new branch] cherry-pick-167478-by-pytorch_bot_bot_ -> origin/cherry-pick-167478-by-pytorch_bot_bot_ 2025-12-04T09:43:32.0346359Z * [new branch] cherry_pick_166036_166040 -> origin/cherry_pick_166036_166040 2025-12-04T09:43:32.0348318Z * [new branch] cherry_pick_166457 -> origin/cherry_pick_166457 2025-12-04T09:43:32.0350112Z * [new branch] cherrypick_166338 -> origin/cherrypick_166338 2025-12-04T09:43:32.0351960Z * [new branch] cherrypick_166458 -> origin/cherrypick_166458 2025-12-04T09:43:32.0353654Z * [new branch] cherrypick_166586 -> origin/cherrypick_166586 2025-12-04T09:43:32.0355590Z * [new branch] cherrypick_166956 -> origin/cherrypick_166956 2025-12-04T09:43:32.0357598Z * [new branch] ci_attn -> origin/ci_attn 2025-12-04T09:43:32.0359327Z * [new branch] codex-testing -> origin/codex-testing 2025-12-04T09:43:32.0362065Z * [new branch] codex/add-check_memory_overlap-helper-functions -> origin/codex/add-check_memory_overlap-helper-functions 2025-12-04T09:43:32.0363661Z * [new branch] codex/fix-issue-121219-in-pytorch -> origin/codex/fix-issue-121219-in-pytorch 2025-12-04T09:43:32.0366059Z * [new branch] codex/investigate-segfaults-in-get_tensor_storage_id -> origin/codex/investigate-segfaults-in-get_tensor_storage_id 2025-12-04T09:43:32.0368168Z * [new branch] codex/refactor-lintrunner-config-to-use-uv-run -> origin/codex/refactor-lintrunner-config-to-use-uv-run 2025-12-04T09:43:32.0369721Z * [new branch] compatiblpy39util -> origin/compatiblpy39util 2025-12-04T09:43:32.0371594Z * [new branch] cond_hop_device -> origin/cond_hop_device 2025-12-04T09:43:32.0373358Z * [new branch] context_test -> origin/context_test 2025-12-04T09:43:32.0375945Z * [new branch] copilot/code-style-cleanup-python-pip -> origin/copilot/code-style-cleanup-python-pip 2025-12-04T09:43:32.0378308Z * [new branch] cpio/fix_new_ami_tests -> origin/cpio/fix_new_ami_tests 2025-12-04T09:43:32.0380190Z * [new branch] cpp-docs-dependency-upgrade -> origin/cpp-docs-dependency-upgrade 2025-12-04T09:43:32.0382647Z * [new branch] crpa/typo-in-inductor_comm_lowering -> origin/crpa/typo-in-inductor_comm_lowering 2025-12-04T09:43:32.0385027Z * [new branch] csl/always_produce_xml -> origin/csl/always_produce_xml 2025-12-04T09:43:32.0386685Z * [new branch] csl/build_test_more_procs -> origin/csl/build_test_more_procs 2025-12-04T09:43:32.0388574Z * [new branch] csl/build_test_more_procs2 -> origin/csl/build_test_more_procs2 2025-12-04T09:43:32.0390320Z * [new branch] csl/clean_up -> origin/csl/clean_up 2025-12-04T09:43:32.0392116Z * [new branch] csl/fix_retry_segfault_exit -> origin/csl/fix_retry_segfault_exit 2025-12-04T09:43:32.0393783Z * [new branch] csl/katex -> origin/csl/katex 2025-12-04T09:43:32.0395757Z * [new branch] csl/larger_runner -> origin/csl/larger_runner 2025-12-04T09:43:32.0397864Z * [new branch] csl/lint_testing -> origin/csl/lint_testing 2025-12-04T09:43:32.0399978Z * [new branch] csl/lint_thing -> origin/csl/lint_thing 2025-12-04T09:43:32.0402018Z * [new branch] csl/lintrunner_stuff -> origin/csl/lintrunner_stuff 2025-12-04T09:43:32.0403762Z * [new branch] csl/manually_gen_json -> origin/csl/manually_gen_json 2025-12-04T09:43:32.0405532Z * [new branch] csl/mps_sharding -> origin/csl/mps_sharding 2025-12-04T09:43:32.0407318Z * [new branch] csl/multistage_docker -> origin/csl/multistage_docker 2025-12-04T09:43:32.0409186Z * [new branch] csl/print_timing -> origin/csl/print_timing 2025-12-04T09:43:32.0411012Z * [new branch] csl/remove_experiment -> origin/csl/remove_experiment 2025-12-04T09:43:32.0412828Z * [new branch] csl/remove_maybe_unused_var -> origin/csl/remove_maybe_unused_var 2025-12-04T09:43:32.0414785Z * [new branch] csl/remove_repo_specific_autolabel -> origin/csl/remove_repo_specific_autolabel 2025-12-04T09:43:32.0416634Z * [new branch] csl/remove_run_parallel -> origin/csl/remove_run_parallel 2025-12-04T09:43:32.0418249Z * [new branch] csl/remove_unused_vars -> origin/csl/remove_unused_vars 2025-12-04T09:43:32.0419988Z * [new branch] csl/revert_open -> origin/csl/revert_open 2025-12-04T09:43:32.0421838Z * [new branch] csl/skip_build -> origin/csl/skip_build 2025-12-04T09:43:32.0423650Z * [new branch] csl/smaller_avx_amx_runenrs -> origin/csl/smaller_avx_amx_runenrs 2025-12-04T09:43:32.0425379Z * [new branch] csl/td_job_level -> origin/csl/td_job_level 2025-12-04T09:43:32.0427263Z * [new branch] csl/test_cuda_build_large_runner -> origin/csl/test_cuda_build_large_runner 2025-12-04T09:43:32.0429247Z * [new branch] csl/test_owners_autograd_dispatch_nn -> origin/csl/test_owners_autograd_dispatch_nn 2025-12-04T09:43:32.0430943Z * [new branch] csl/test_owners_higher_confidence -> origin/csl/test_owners_higher_confidence 2025-12-04T09:43:32.0432650Z * [new branch] csl/upload_json_running -> origin/csl/upload_json_running 2025-12-04T09:43:32.0434477Z * [new branch] csl/win_sccache -> origin/csl/win_sccache 2025-12-04T09:43:32.0436206Z * [new branch] csl/xml_stuff -> origin/csl/xml_stuff 2025-12-04T09:43:32.0438057Z * [new branch] cublasrelax2 -> origin/cublasrelax2 2025-12-04T09:43:32.0439868Z * [new branch] cuda_mempool -> origin/cuda_mempool 2025-12-04T09:43:32.0441651Z * [new branch] custom_lowering_dict -> origin/custom_lowering_dict 2025-12-04T09:43:32.0444088Z * [new branch] d4l3k/debug_plane_frtrace -> origin/d4l3k/debug_plane_frtrace 2025-12-04T09:43:32.0446424Z * [new branch] daxia6/2.8o3 -> origin/daxia6/2.8o3 2025-12-04T09:43:32.0448258Z * [new branch] debug-guard -> origin/debug-guard 2025-12-04T09:43:32.0450158Z * [new branch] delete-quant-docs -> origin/delete-quant-docs 2025-12-04T09:43:32.0455482Z * [new branch] dependabot/pip/dot-ci/docker/ci_commit_pins/main/transformers-4.57.0 -> origin/dependabot/pip/dot-ci/docker/ci_commit_pins/main/transformers-4.57.0 2025-12-04T09:43:32.0457757Z * [new branch] dependabot/pip/dot-ci/docker/ci_commit_pins/main/transformers-4.57.1 -> origin/dependabot/pip/dot-ci/docker/ci_commit_pins/main/transformers-4.57.1 2025-12-04T09:43:32.0459898Z * [new branch] desertfire/test_cpp_wrapper -> origin/desertfire/test_cpp_wrapper 2025-12-04T09:43:32.0461717Z * [new branch] desertfire/triton-cpu-for-aarch64 -> origin/desertfire/triton-cpu-for-aarch64 2025-12-04T09:43:32.0464471Z * [new branch] dev/dhruva/flex_attn_opt -> origin/dev/dhruva/flex_attn_opt 2025-12-04T09:43:32.0467156Z * [new branch] dev/joona/MPSNDArrayAdd -> origin/dev/joona/MPSNDArrayAdd 2025-12-04T09:43:32.0469419Z * [new branch] dev/joona/Unranked -> origin/dev/joona/Unranked 2025-12-04T09:43:32.0471259Z * [new branch] dev/joona/cat -> origin/dev/joona/cat 2025-12-04T09:43:32.0473004Z * [new branch] dev/joona/embeddingbag -> origin/dev/joona/embeddingbag 2025-12-04T09:43:32.0474820Z * [new branch] dev/joona/fix_sdpa_memtest -> origin/dev/joona/fix_sdpa_memtest 2025-12-04T09:43:32.0476849Z * [new branch] dev/joona/getTensorsString -> origin/dev/joona/getTensorsString 2025-12-04T09:43:32.0478775Z * [new branch] dev/joona/mps_linear_macos14 -> origin/dev/joona/mps_linear_macos14 2025-12-04T09:43:32.0481031Z * [new branch] dev/joona/scalar_clamp -> origin/dev/joona/scalar_clamp 2025-12-04T09:43:32.0483263Z * [new branch] dev/joona/sdpa -> origin/dev/joona/sdpa 2025-12-04T09:43:32.0485780Z * [new branch] dev/joona/sdpa_api -> origin/dev/joona/sdpa_api 2025-12-04T09:43:32.0487677Z * [new branch] dev/joona/type_inf -> origin/dev/joona/type_inf 2025-12-04T09:43:32.0489745Z * [new branch] dev/joona/ulpAssertClose -> origin/dev/joona/ulpAssertClose 2025-12-04T09:43:32.0491626Z * [new branch] dev/joona/upsize3d -> origin/dev/joona/upsize3d 2025-12-04T09:43:32.0493402Z * [new branch] disp_counter -> origin/disp_counter 2025-12-04T09:43:32.0495229Z * [new branch] divyanshk-patch-1 -> origin/divyanshk-patch-1 2025-12-04T09:43:32.0497158Z * [new branch] docs -> origin/docs 2025-12-04T09:43:32.0499047Z * [new branch] documentation -> origin/documentation 2025-12-04T09:43:32.0500849Z * [new branch] eager_model_benchmarks -> origin/eager_model_benchmarks 2025-12-04T09:43:32.0503325Z * [new branch] embg/test_inductor_ci_control -> origin/embg/test_inductor_ci_control 2025-12-04T09:43:32.0505027Z * [new branch] embg/triton_l2_prefetch_128B -> origin/embg/triton_l2_prefetch_128B 2025-12-04T09:43:32.0506651Z * [new branch] embg/triton_l2_prefetch_256B -> origin/embg/triton_l2_prefetch_256B 2025-12-04T09:43:32.0508582Z * [new branch] eqy-patch-1 -> origin/eqy-patch-1 2025-12-04T09:43:32.0510420Z * [new branch] eqy-patch-2 -> origin/eqy-patch-2 2025-12-04T09:43:32.0512243Z * [new branch] eqy-patch-3 -> origin/eqy-patch-3 2025-12-04T09:43:32.0514128Z * [new branch] eqy-patch-4 -> origin/eqy-patch-4 2025-12-04T09:43:32.0515947Z * [new branch] eqy-patch-5 -> origin/eqy-patch-5 2025-12-04T09:43:32.0517753Z * [new branch] eqy-patch-6 -> origin/eqy-patch-6 2025-12-04T09:43:32.0520205Z * [new branch] exclamaforte/amd-ma -> origin/exclamaforte/amd-ma 2025-12-04T09:43:32.0522057Z * [new branch] exclamaforte/combo-kernels-perf-run -> origin/exclamaforte/combo-kernels-perf-run 2025-12-04T09:43:32.0523662Z * [new branch] exclamaforte/do_bench_refactor -> origin/exclamaforte/do_bench_refactor 2025-12-04T09:43:32.0525412Z * [new branch] exclamaforte/enable-mem-dep-fusion -> origin/exclamaforte/enable-mem-dep-fusion 2025-12-04T09:43:32.0527265Z * [new branch] exclamaforte/fix-exhaustive-autotuning -> origin/exclamaforte/fix-exhaustive-autotuning 2025-12-04T09:43:32.0529313Z * [new branch] exclamaforte/fix-trace-parsing-fx-svg -> origin/exclamaforte/fix-trace-parsing-fx-svg 2025-12-04T09:43:32.0531558Z * [new branch] exclamaforte/force-pointwise-cat-perf-run -> origin/exclamaforte/force-pointwise-cat-perf-run 2025-12-04T09:43:32.0533257Z * [new branch] exclamaforte/fusion-data -> origin/exclamaforte/fusion-data 2025-12-04T09:43:32.0535307Z * [new branch] exclamaforte/gemm-benchmark-run -> origin/exclamaforte/gemm-benchmark-run 2025-12-04T09:43:32.0537011Z * [new branch] exclamaforte/gemm-export-model -> origin/exclamaforte/gemm-export-model 2025-12-04T09:43:32.0538797Z * [new branch] exclamaforte/gemm-model -> origin/exclamaforte/gemm-model 2025-12-04T09:43:32.0540815Z * [new branch] exclamaforte/gemm-model-all-data-collection -> origin/exclamaforte/gemm-model-all-data-collection 2025-12-04T09:43:32.0542371Z * [new branch] exclamaforte/gemm-to-amd -> origin/exclamaforte/gemm-to-amd 2025-12-04T09:43:32.0544292Z * [new branch] exclamaforte/just-gemm-model -> origin/exclamaforte/just-gemm-model 2025-12-04T09:43:32.0546219Z * [new branch] exclamaforte/just-gemm-model-no-refactor -> origin/exclamaforte/just-gemm-model-no-refactor 2025-12-04T09:43:32.0548122Z * [new branch] exclamaforte/profile-diff-algo -> origin/exclamaforte/profile-diff-algo 2025-12-04T09:43:32.0549981Z * [new branch] exclamaforte/profiler-visualization -> origin/exclamaforte/profiler-visualization 2025-12-04T09:43:32.0551810Z * [new branch] exclamaforte/test_cpp_wrapper_mode -> origin/exclamaforte/test_cpp_wrapper_mode 2025-12-04T09:43:32.0553736Z * [new branch] exclamaforte/update-autotune-configs -> origin/exclamaforte/update-autotune-configs 2025-12-04T09:43:32.0555718Z * [new branch] exclamaforte/update-autotune-configs-2 -> origin/exclamaforte/update-autotune-configs-2 2025-12-04T09:43:32.0557703Z * [new branch] exec -> origin/exec 2025-12-04T09:43:32.0559650Z * [new branch] experimental-mosaic -> origin/experimental-mosaic 2025-12-04T09:43:32.0561518Z * [new branch] export-D61047529 -> origin/export-D61047529 2025-12-04T09:43:32.0563301Z * [new branch] export-D71412006 -> origin/export-D71412006 2025-12-04T09:43:32.0565227Z * [new branch] export-D73042989 -> origin/export-D73042989 2025-12-04T09:43:32.0566988Z * [new branch] export-D78957093 -> origin/export-D78957093 2025-12-04T09:43:32.0568756Z * [new branch] export-D78996107 -> origin/export-D78996107 2025-12-04T09:43:32.0570626Z * [new branch] export-D80823877 -> origin/export-D80823877 2025-12-04T09:43:32.0572585Z * [new branch] export-D80958642 -> origin/export-D80958642 2025-12-04T09:43:32.0574345Z * [new branch] export-D81054193 -> origin/export-D81054193 2025-12-04T09:43:32.0576062Z * [new branch] export-D81204584 -> origin/export-D81204584 2025-12-04T09:43:32.0577835Z * [new branch] export-D81429090 -> origin/export-D81429090 2025-12-04T09:43:32.0579797Z * [new branch] export-D82250826 -> origin/export-D82250826 2025-12-04T09:43:32.0581603Z * [new branch] export-D82253817 -> origin/export-D82253817 2025-12-04T09:43:32.0583408Z * [new branch] export-D83541846 -> origin/export-D83541846 2025-12-04T09:43:32.0585210Z * [new branch] export-D83627170 -> origin/export-D83627170 2025-12-04T09:43:32.0586995Z * [new branch] export-D83766701 -> origin/export-D83766701 2025-12-04T09:43:32.0588924Z * [new branch] export-D83768878 -> origin/export-D83768878 2025-12-04T09:43:32.0590664Z * [new branch] export-D83769447 -> origin/export-D83769447 2025-12-04T09:43:32.0592514Z * [new branch] export-D84089824 -> origin/export-D84089824 2025-12-04T09:43:32.0594255Z * [new branch] export-D84213020 -> origin/export-D84213020 2025-12-04T09:43:32.0596475Z * [new branch] export-D84373821 -> origin/export-D84373821 2025-12-04T09:43:32.0598387Z * [new branch] export-D84612194 -> origin/export-D84612194 2025-12-04T09:43:32.0600095Z * [new branch] export-D84890985 -> origin/export-D84890985 2025-12-04T09:43:32.0601852Z * [new branch] export-D85122326 -> origin/export-D85122326 2025-12-04T09:43:32.0603714Z * [new branch] export-D86256198 -> origin/export-D86256198 2025-12-04T09:43:32.0606039Z * [new branch] export-D86460608 -> origin/export-D86460608 2025-12-04T09:43:32.0607969Z * [new branch] export-D86474796 -> origin/export-D86474796 2025-12-04T09:43:32.0609909Z * [new branch] export-D86712396 -> origin/export-D86712396 2025-12-04T09:43:32.0611653Z * [new branch] export-D87022129 -> origin/export-D87022129 2025-12-04T09:43:32.0613472Z * [new branch] export-D87838959 -> origin/export-D87838959 2025-12-04T09:43:32.0615375Z * [new branch] export-D88319437 -> origin/export-D88319437 2025-12-04T09:43:32.0617338Z * [new branch] exported-model-train-idempotent -> origin/exported-model-train-idempotent 2025-12-04T09:43:32.0619111Z * [new branch] ezyang-titan-october -> origin/ezyang-titan-october 2025-12-04T09:43:32.0620871Z * [new branch] ezyang-titan-october2 -> origin/ezyang-titan-october2 2025-12-04T09:43:32.0622583Z * [new branch] ezyang-war -> origin/ezyang-war 2025-12-04T09:43:32.0625015Z * [new branch] ezyang/wip-aot-descriptors -> origin/ezyang/wip-aot-descriptors 2025-12-04T09:43:32.0626753Z * [new branch] fa_u8_brgemm -> origin/fa_u8_brgemm 2025-12-04T09:43:32.0629479Z * [new branch] fadeputr/sequence_fbgemm -> origin/fadeputr/sequence_fbgemm 2025-12-04T09:43:32.0631294Z * [new branch] fastmath_baseline -> origin/fastmath_baseline 2025-12-04T09:43:32.0633796Z * [new branch] fbcode/warm -> origin/fbcode/warm 2025-12-04T09:43:32.0635639Z * [new branch] fca -> origin/fca 2025-12-04T09:43:32.0637404Z * [new branch] fca2_ca5984c -> origin/fca2_ca5984c 2025-12-04T09:43:32.0639175Z * [new branch] fca5 -> origin/fca5 2025-12-04T09:43:32.0641532Z * [new branch] feature/justknobs-cpp -> origin/feature/justknobs-cpp 2025-12-04T09:43:32.0643316Z * [new branch] feature/numa-forkserver -> origin/feature/numa-forkserver 2025-12-04T09:43:32.0645394Z * [new branch] ffast_math_baseline -> origin/ffast_math_baseline 2025-12-04T09:43:32.0647128Z * [new branch] ffast_math_target -> origin/ffast_math_target 2025-12-04T09:43:32.0649598Z * [new branch] findhao/base_commit -> origin/findhao/base_commit 2025-12-04T09:43:32.0651338Z * [new branch] findhao/base_commit1 -> origin/findhao/base_commit1 2025-12-04T09:43:32.0653024Z * [new branch] findhao/multistream2 -> origin/findhao/multistream2 2025-12-04T09:43:32.0654714Z * [new branch] findhao/multistream5 -> origin/findhao/multistream5 2025-12-04T09:43:32.0657576Z * [new branch] findhao/multistream6 -> origin/findhao/multistream6 2025-12-04T09:43:32.0659283Z * [new branch] findhao/operatorbench3 -> origin/findhao/operatorbench3 2025-12-04T09:43:32.0660966Z * [new branch] findhao/operatorbench5 -> origin/findhao/operatorbench5 2025-12-04T09:43:32.0662636Z * [new branch] findhao/tritonparse -> origin/findhao/tritonparse 2025-12-04T09:43:32.0664518Z * [new branch] fix-ck-gemm-template-format -> origin/fix-ck-gemm-template-format 2025-12-04T09:43:32.0666514Z * [new branch] fix-config-ignore -> origin/fix-config-ignore 2025-12-04T09:43:32.0668312Z * [new branch] fix-dict-guard -> origin/fix-dict-guard 2025-12-04T09:43:32.0670120Z * [new branch] fix_addmm_issue -> origin/fix_addmm_issue 2025-12-04T09:43:32.0671931Z * [new branch] fix_amd_missing_cluster_dims -> origin/fix_amd_missing_cluster_dims 2025-12-04T09:43:32.0673686Z * [new branch] fix_bench_bwd_pass -> origin/fix_bench_bwd_pass 2025-12-04T09:43:32.0675509Z * [new branch] fix_mem_profiler_config -> origin/fix_mem_profiler_config 2025-12-04T09:43:32.0677207Z * [new branch] fix_nvrtc_discovery -> origin/fix_nvrtc_discovery 2025-12-04T09:43:32.0678951Z * [new branch] fix_op_runner -> origin/fix_op_runner 2025-12-04T09:43:32.0680768Z * [new branch] fix_ubn_159469 -> origin/fix_ubn_159469 2025-12-04T09:43:32.0682646Z * [new branch] fixes-triage -> origin/fixes-triage 2025-12-04T09:43:32.0684428Z * [new branch] fixflashinfer -> origin/fixflashinfer 2025-12-04T09:43:32.0686177Z * [new branch] flash_decoding_cpu -> origin/flash_decoding_cpu 2025-12-04T09:43:32.0687956Z * [new branch] flex-flash -> origin/flex-flash 2025-12-04T09:43:32.0689753Z * [new branch] flex_attention_functorch_grad -> origin/flex_attention_functorch_grad 2025-12-04T09:43:32.0691762Z * [new branch] flex_flash -> origin/flex_flash 2025-12-04T09:43:32.0694265Z * [new branch] fmassa/fix_memeff_sharding_rule -> origin/fmassa/fix_memeff_sharding_rule 2025-12-04T09:43:32.0696069Z * [new branch] fmassa/tests_comm_compute_scheduler -> origin/fmassa/tests_comm_compute_scheduler 2025-12-04T09:43:32.0697751Z * [new branch] forkserver_fix -> origin/forkserver_fix 2025-12-04T09:43:32.0699591Z * [new branch] fsdp2_trace_rules -> origin/fsdp2_trace_rules 2025-12-04T09:43:32.0701529Z * [new branch] fx_cpp -> origin/fx_cpp 2025-12-04T09:43:32.0703917Z * [new branch] fy/fix-win -> origin/fy/fix-win 2025-12-04T09:43:32.0705827Z * [new branch] galv-patch-1 -> origin/galv-patch-1 2025-12-04T09:43:32.0708564Z * [new branch] galv/cudagraphs-conditional-nodes-4 -> origin/galv/cudagraphs-conditional-nodes-4 2025-12-04T09:43:32.0711067Z * [new branch] georgehong/cmakelists-patch -> origin/georgehong/cmakelists-patch 2025-12-04T09:43:32.0714436Z * [new branch] gh/AlnisM/1/base -> origin/gh/AlnisM/1/base 2025-12-04T09:43:32.0716246Z * [new branch] gh/AlnisM/1/head -> origin/gh/AlnisM/1/head 2025-12-04T09:43:32.0719112Z * [new branch] gh/EikanWang/67/base -> origin/gh/EikanWang/67/base 2025-12-04T09:43:32.0720827Z * [new branch] gh/EikanWang/67/head -> origin/gh/EikanWang/67/head 2025-12-04T09:43:32.0723875Z * [new branch] gh/Gasoonjia/1/base -> origin/gh/Gasoonjia/1/base 2025-12-04T09:43:32.0725637Z * [new branch] gh/Gasoonjia/1/head -> origin/gh/Gasoonjia/1/head 2025-12-04T09:43:32.0728516Z * [new branch] gh/H-Huang/131/base -> origin/gh/H-Huang/131/base 2025-12-04T09:43:32.0730225Z * [new branch] gh/H-Huang/131/head -> origin/gh/H-Huang/131/head 2025-12-04T09:43:32.0732019Z * [new branch] gh/H-Huang/131/orig -> origin/gh/H-Huang/131/orig 2025-12-04T09:43:32.0734473Z * [new branch] gh/H-Huang/132/base -> origin/gh/H-Huang/132/base 2025-12-04T09:43:32.0736214Z * [new branch] gh/H-Huang/132/head -> origin/gh/H-Huang/132/head 2025-12-04T09:43:32.0738001Z * [new branch] gh/H-Huang/132/orig -> origin/gh/H-Huang/132/orig 2025-12-04T09:43:32.0740566Z * [new branch] gh/H-Huang/180/base -> origin/gh/H-Huang/180/base 2025-12-04T09:43:32.0742158Z * [new branch] gh/H-Huang/180/head -> origin/gh/H-Huang/180/head 2025-12-04T09:43:32.0743951Z * [new branch] gh/H-Huang/180/orig -> origin/gh/H-Huang/180/orig 2025-12-04T09:43:32.0746215Z * [new branch] gh/H-Huang/182/base -> origin/gh/H-Huang/182/base 2025-12-04T09:43:32.0748132Z * [new branch] gh/H-Huang/182/head -> origin/gh/H-Huang/182/head 2025-12-04T09:43:32.0749851Z * [new branch] gh/H-Huang/182/orig -> origin/gh/H-Huang/182/orig 2025-12-04T09:43:32.0752388Z * [new branch] gh/H-Huang/226/base -> origin/gh/H-Huang/226/base 2025-12-04T09:43:32.0754127Z * [new branch] gh/H-Huang/226/head -> origin/gh/H-Huang/226/head 2025-12-04T09:43:32.0756017Z * [new branch] gh/H-Huang/226/orig -> origin/gh/H-Huang/226/orig 2025-12-04T09:43:32.0758610Z * [new branch] gh/H-Huang/228/base -> origin/gh/H-Huang/228/base 2025-12-04T09:43:32.0760347Z * [new branch] gh/H-Huang/228/head -> origin/gh/H-Huang/228/head 2025-12-04T09:43:32.0762080Z * [new branch] gh/H-Huang/228/orig -> origin/gh/H-Huang/228/orig 2025-12-04T09:43:32.0764999Z * [new branch] gh/IvanKobzarev/150/base -> origin/gh/IvanKobzarev/150/base 2025-12-04T09:43:32.0766731Z * [new branch] gh/IvanKobzarev/150/head -> origin/gh/IvanKobzarev/150/head 2025-12-04T09:43:32.0768415Z * [new branch] gh/IvanKobzarev/150/orig -> origin/gh/IvanKobzarev/150/orig 2025-12-04T09:43:32.0770820Z * [new branch] gh/IvanKobzarev/157/base -> origin/gh/IvanKobzarev/157/base 2025-12-04T09:43:32.0772671Z * [new branch] gh/IvanKobzarev/157/head -> origin/gh/IvanKobzarev/157/head 2025-12-04T09:43:32.0774418Z * [new branch] gh/IvanKobzarev/157/orig -> origin/gh/IvanKobzarev/157/orig 2025-12-04T09:43:32.0776865Z * [new branch] gh/IvanKobzarev/159/base -> origin/gh/IvanKobzarev/159/base 2025-12-04T09:43:32.0778649Z * [new branch] gh/IvanKobzarev/159/head -> origin/gh/IvanKobzarev/159/head 2025-12-04T09:43:32.0780522Z * [new branch] gh/IvanKobzarev/159/orig -> origin/gh/IvanKobzarev/159/orig 2025-12-04T09:43:32.0782935Z * [new branch] gh/IvanKobzarev/162/base -> origin/gh/IvanKobzarev/162/base 2025-12-04T09:43:32.0784790Z * [new branch] gh/IvanKobzarev/162/head -> origin/gh/IvanKobzarev/162/head 2025-12-04T09:43:32.0786546Z * [new branch] gh/IvanKobzarev/162/orig -> origin/gh/IvanKobzarev/162/orig 2025-12-04T09:43:32.0789101Z * [new branch] gh/IvanKobzarev/163/base -> origin/gh/IvanKobzarev/163/base 2025-12-04T09:43:32.0790748Z * [new branch] gh/IvanKobzarev/163/head -> origin/gh/IvanKobzarev/163/head 2025-12-04T09:43:32.0792455Z * [new branch] gh/IvanKobzarev/163/orig -> origin/gh/IvanKobzarev/163/orig 2025-12-04T09:43:32.0794866Z * [new branch] gh/IvanKobzarev/166/base -> origin/gh/IvanKobzarev/166/base 2025-12-04T09:43:32.0796660Z * [new branch] gh/IvanKobzarev/166/head -> origin/gh/IvanKobzarev/166/head 2025-12-04T09:43:32.0798374Z * [new branch] gh/IvanKobzarev/166/orig -> origin/gh/IvanKobzarev/166/orig 2025-12-04T09:43:32.0800814Z * [new branch] gh/IvanKobzarev/167/base -> origin/gh/IvanKobzarev/167/base 2025-12-04T09:43:32.0802501Z * [new branch] gh/IvanKobzarev/167/head -> origin/gh/IvanKobzarev/167/head 2025-12-04T09:43:32.0804291Z * [new branch] gh/IvanKobzarev/167/orig -> origin/gh/IvanKobzarev/167/orig 2025-12-04T09:43:32.0806646Z * [new branch] gh/IvanKobzarev/168/base -> origin/gh/IvanKobzarev/168/base 2025-12-04T09:43:32.0808526Z * [new branch] gh/IvanKobzarev/168/head -> origin/gh/IvanKobzarev/168/head 2025-12-04T09:43:32.0810165Z * [new branch] gh/IvanKobzarev/168/orig -> origin/gh/IvanKobzarev/168/orig 2025-12-04T09:43:32.0812541Z * [new branch] gh/IvanKobzarev/169/base -> origin/gh/IvanKobzarev/169/base 2025-12-04T09:43:32.0814323Z * [new branch] gh/IvanKobzarev/169/head -> origin/gh/IvanKobzarev/169/head 2025-12-04T09:43:32.0816081Z * [new branch] gh/IvanKobzarev/169/orig -> origin/gh/IvanKobzarev/169/orig 2025-12-04T09:43:32.0818354Z * [new branch] gh/IvanKobzarev/170/base -> origin/gh/IvanKobzarev/170/base 2025-12-04T09:43:32.0820100Z * [new branch] gh/IvanKobzarev/170/head -> origin/gh/IvanKobzarev/170/head 2025-12-04T09:43:32.0821815Z * [new branch] gh/IvanKobzarev/170/orig -> origin/gh/IvanKobzarev/170/orig 2025-12-04T09:43:32.0824379Z * [new branch] gh/IvanKobzarev/171/base -> origin/gh/IvanKobzarev/171/base 2025-12-04T09:43:32.0826117Z * [new branch] gh/IvanKobzarev/171/head -> origin/gh/IvanKobzarev/171/head 2025-12-04T09:43:32.0827961Z * [new branch] gh/IvanKobzarev/171/orig -> origin/gh/IvanKobzarev/171/orig 2025-12-04T09:43:32.0830417Z * [new branch] gh/IvanKobzarev/172/base -> origin/gh/IvanKobzarev/172/base 2025-12-04T09:43:32.0832302Z * [new branch] gh/IvanKobzarev/172/head -> origin/gh/IvanKobzarev/172/head 2025-12-04T09:43:32.0834004Z * [new branch] gh/IvanKobzarev/172/orig -> origin/gh/IvanKobzarev/172/orig 2025-12-04T09:43:32.0836410Z * [new branch] gh/IvanKobzarev/173/base -> origin/gh/IvanKobzarev/173/base 2025-12-04T09:43:32.0838171Z * [new branch] gh/IvanKobzarev/173/head -> origin/gh/IvanKobzarev/173/head 2025-12-04T09:43:32.0839896Z * [new branch] gh/IvanKobzarev/173/orig -> origin/gh/IvanKobzarev/173/orig 2025-12-04T09:43:32.0842365Z * [new branch] gh/IvanKobzarev/174/base -> origin/gh/IvanKobzarev/174/base 2025-12-04T09:43:32.0844229Z * [new branch] gh/IvanKobzarev/174/head -> origin/gh/IvanKobzarev/174/head 2025-12-04T09:43:32.0845993Z * [new branch] gh/IvanKobzarev/174/orig -> origin/gh/IvanKobzarev/174/orig 2025-12-04T09:43:32.0848445Z * [new branch] gh/IvanKobzarev/175/base -> origin/gh/IvanKobzarev/175/base 2025-12-04T09:43:32.0850273Z * [new branch] gh/IvanKobzarev/175/head -> origin/gh/IvanKobzarev/175/head 2025-12-04T09:43:32.0852165Z * [new branch] gh/IvanKobzarev/175/orig -> origin/gh/IvanKobzarev/175/orig 2025-12-04T09:43:32.0854686Z * [new branch] gh/IvanKobzarev/176/base -> origin/gh/IvanKobzarev/176/base 2025-12-04T09:43:32.0856711Z * [new branch] gh/IvanKobzarev/176/head -> origin/gh/IvanKobzarev/176/head 2025-12-04T09:43:32.0858425Z * [new branch] gh/IvanKobzarev/176/orig -> origin/gh/IvanKobzarev/176/orig 2025-12-04T09:43:32.0861127Z * [new branch] gh/IvanKobzarev/177/base -> origin/gh/IvanKobzarev/177/base 2025-12-04T09:43:32.0862933Z * [new branch] gh/IvanKobzarev/177/head -> origin/gh/IvanKobzarev/177/head 2025-12-04T09:43:32.0864670Z * [new branch] gh/IvanKobzarev/177/orig -> origin/gh/IvanKobzarev/177/orig 2025-12-04T09:43:32.0867254Z * [new branch] gh/IvanKobzarev/178/base -> origin/gh/IvanKobzarev/178/base 2025-12-04T09:43:32.0869100Z * [new branch] gh/IvanKobzarev/178/head -> origin/gh/IvanKobzarev/178/head 2025-12-04T09:43:32.0870915Z * [new branch] gh/IvanKobzarev/178/orig -> origin/gh/IvanKobzarev/178/orig 2025-12-04T09:43:32.0873328Z * [new branch] gh/IvanKobzarev/179/base -> origin/gh/IvanKobzarev/179/base 2025-12-04T09:43:32.0875020Z * [new branch] gh/IvanKobzarev/179/head -> origin/gh/IvanKobzarev/179/head 2025-12-04T09:43:32.0876942Z * [new branch] gh/IvanKobzarev/179/orig -> origin/gh/IvanKobzarev/179/orig 2025-12-04T09:43:32.0879251Z * [new branch] gh/IvanKobzarev/180/base -> origin/gh/IvanKobzarev/180/base 2025-12-04T09:43:32.0880997Z * [new branch] gh/IvanKobzarev/180/head -> origin/gh/IvanKobzarev/180/head 2025-12-04T09:43:32.0882783Z * [new branch] gh/IvanKobzarev/180/orig -> origin/gh/IvanKobzarev/180/orig 2025-12-04T09:43:32.0885450Z * [new branch] gh/IvanKobzarev/181/base -> origin/gh/IvanKobzarev/181/base 2025-12-04T09:43:32.0887171Z * [new branch] gh/IvanKobzarev/181/head -> origin/gh/IvanKobzarev/181/head 2025-12-04T09:43:32.0888897Z * [new branch] gh/IvanKobzarev/181/orig -> origin/gh/IvanKobzarev/181/orig 2025-12-04T09:43:32.0891524Z * [new branch] gh/IvanKobzarev/182/base -> origin/gh/IvanKobzarev/182/base 2025-12-04T09:43:32.0893299Z * [new branch] gh/IvanKobzarev/182/head -> origin/gh/IvanKobzarev/182/head 2025-12-04T09:43:32.0895078Z * [new branch] gh/IvanKobzarev/182/orig -> origin/gh/IvanKobzarev/182/orig 2025-12-04T09:43:32.0897591Z * [new branch] gh/IvanKobzarev/183/base -> origin/gh/IvanKobzarev/183/base 2025-12-04T09:43:32.0899388Z * [new branch] gh/IvanKobzarev/183/head -> origin/gh/IvanKobzarev/183/head 2025-12-04T09:43:32.0901279Z * [new branch] gh/IvanKobzarev/183/orig -> origin/gh/IvanKobzarev/183/orig 2025-12-04T09:43:32.0903677Z * [new branch] gh/IvanKobzarev/184/base -> origin/gh/IvanKobzarev/184/base 2025-12-04T09:43:32.0905497Z * [new branch] gh/IvanKobzarev/184/head -> origin/gh/IvanKobzarev/184/head 2025-12-04T09:43:32.0907337Z * [new branch] gh/IvanKobzarev/184/orig -> origin/gh/IvanKobzarev/184/orig 2025-12-04T09:43:32.0910296Z * [new branch] gh/NikhilAPatel/1/base -> origin/gh/NikhilAPatel/1/base 2025-12-04T09:43:32.0912110Z * [new branch] gh/NikhilAPatel/1/head -> origin/gh/NikhilAPatel/1/head 2025-12-04T09:43:32.0914301Z * [new branch] gh/NikhilAPatel/2/base -> origin/gh/NikhilAPatel/2/base 2025-12-04T09:43:32.0916030Z * [new branch] gh/NikhilAPatel/2/head -> origin/gh/NikhilAPatel/2/head 2025-12-04T09:43:32.0918577Z * [new branch] gh/NikhilAPatel/4/base -> origin/gh/NikhilAPatel/4/base 2025-12-04T09:43:32.0920433Z * [new branch] gh/NikhilAPatel/4/head -> origin/gh/NikhilAPatel/4/head 2025-12-04T09:43:32.0922847Z * [new branch] gh/NikhilAPatel/5/base -> origin/gh/NikhilAPatel/5/base 2025-12-04T09:43:32.0924630Z * [new branch] gh/NikhilAPatel/5/head -> origin/gh/NikhilAPatel/5/head 2025-12-04T09:43:32.0926420Z * [new branch] gh/NikhilAPatel/5/orig -> origin/gh/NikhilAPatel/5/orig 2025-12-04T09:43:32.0929192Z * [new branch] gh/PaliC/17/base -> origin/gh/PaliC/17/base 2025-12-04T09:43:32.0931060Z * [new branch] gh/PaliC/17/head -> origin/gh/PaliC/17/head 2025-12-04T09:43:32.0932808Z * [new branch] gh/PaliC/17/orig -> origin/gh/PaliC/17/orig 2025-12-04T09:43:32.0935242Z * [new branch] gh/PaliC/18/base -> origin/gh/PaliC/18/base 2025-12-04T09:43:32.0937018Z * [new branch] gh/PaliC/18/head -> origin/gh/PaliC/18/head 2025-12-04T09:43:32.0938875Z * [new branch] gh/PaliC/18/orig -> origin/gh/PaliC/18/orig 2025-12-04T09:43:32.0941233Z * [new branch] gh/PaliC/20/base -> origin/gh/PaliC/20/base 2025-12-04T09:43:32.0942976Z * [new branch] gh/PaliC/20/head -> origin/gh/PaliC/20/head 2025-12-04T09:43:32.0944734Z * [new branch] gh/PaliC/20/orig -> origin/gh/PaliC/20/orig 2025-12-04T09:43:32.0947274Z * [new branch] gh/PaliC/21/base -> origin/gh/PaliC/21/base 2025-12-04T09:43:32.0949292Z * [new branch] gh/PaliC/21/head -> origin/gh/PaliC/21/head 2025-12-04T09:43:32.0950968Z * [new branch] gh/PaliC/21/orig -> origin/gh/PaliC/21/orig 2025-12-04T09:43:32.0953241Z * [new branch] gh/PaliC/23/base -> origin/gh/PaliC/23/base 2025-12-04T09:43:32.0954937Z * [new branch] gh/PaliC/23/head -> origin/gh/PaliC/23/head 2025-12-04T09:43:32.0957064Z * [new branch] gh/PaliC/23/orig -> origin/gh/PaliC/23/orig 2025-12-04T09:43:32.0959403Z * [new branch] gh/PaliC/24/base -> origin/gh/PaliC/24/base 2025-12-04T09:43:32.0961140Z * [new branch] gh/PaliC/24/head -> origin/gh/PaliC/24/head 2025-12-04T09:43:32.0962863Z * [new branch] gh/PaliC/24/orig -> origin/gh/PaliC/24/orig 2025-12-04T09:43:32.0977357Z * [new branch] gh/PaliC/25/head -> origin/gh/PaliC/25/head 2025-12-04T09:43:32.0977802Z * [new branch] gh/PaliC/25/next -> origin/gh/PaliC/25/next 2025-12-04T09:43:32.0978297Z * [new branch] gh/PaliC/25/orig -> origin/gh/PaliC/25/orig 2025-12-04T09:43:32.0978965Z * [new branch] gh/PaliC/26/head -> origin/gh/PaliC/26/head 2025-12-04T09:43:32.0979493Z * [new branch] gh/PaliC/26/next -> origin/gh/PaliC/26/next 2025-12-04T09:43:32.0979865Z * [new branch] gh/PaliC/26/orig -> origin/gh/PaliC/26/orig 2025-12-04T09:43:32.0980340Z * [new branch] gh/PaliC/27/next -> origin/gh/PaliC/27/next 2025-12-04T09:43:32.0980985Z * [new branch] gh/PaliC/28/head -> origin/gh/PaliC/28/head 2025-12-04T09:43:32.0981358Z * [new branch] gh/PaliC/28/next -> origin/gh/PaliC/28/next 2025-12-04T09:43:32.0982772Z * [new branch] gh/PaliC/28/orig -> origin/gh/PaliC/28/orig 2025-12-04T09:43:32.0985325Z * [new branch] gh/PaliC/29/head -> origin/gh/PaliC/29/head 2025-12-04T09:43:32.0986728Z * [new branch] gh/PaliC/29/next -> origin/gh/PaliC/29/next 2025-12-04T09:43:32.0988844Z * [new branch] gh/PaliC/29/orig -> origin/gh/PaliC/29/orig 2025-12-04T09:43:32.0991194Z * [new branch] gh/PaliC/30/head -> origin/gh/PaliC/30/head 2025-12-04T09:43:32.0992572Z * [new branch] gh/PaliC/30/next -> origin/gh/PaliC/30/next 2025-12-04T09:43:32.0994514Z * [new branch] gh/PaliC/30/orig -> origin/gh/PaliC/30/orig 2025-12-04T09:43:32.0996763Z * [new branch] gh/PaliC/31/head -> origin/gh/PaliC/31/head 2025-12-04T09:43:32.0998517Z * [new branch] gh/PaliC/31/next -> origin/gh/PaliC/31/next 2025-12-04T09:43:32.1000077Z * [new branch] gh/PaliC/31/orig -> origin/gh/PaliC/31/orig 2025-12-04T09:43:32.1002930Z * [new branch] gh/PaulZhang12/25/base -> origin/gh/PaulZhang12/25/base 2025-12-04T09:43:32.1004978Z * [new branch] gh/PaulZhang12/25/head -> origin/gh/PaulZhang12/25/head 2025-12-04T09:43:32.1006755Z * [new branch] gh/PaulZhang12/25/orig -> origin/gh/PaulZhang12/25/orig 2025-12-04T09:43:32.1009103Z * [new branch] gh/PaulZhang12/28/base -> origin/gh/PaulZhang12/28/base 2025-12-04T09:43:32.1010938Z * [new branch] gh/PaulZhang12/28/head -> origin/gh/PaulZhang12/28/head 2025-12-04T09:43:32.1012675Z * [new branch] gh/PaulZhang12/28/orig -> origin/gh/PaulZhang12/28/orig 2025-12-04T09:43:32.1015386Z * [new branch] gh/PaulZhang12/31/base -> origin/gh/PaulZhang12/31/base 2025-12-04T09:43:32.1018590Z * [new branch] gh/PaulZhang12/31/head -> origin/gh/PaulZhang12/31/head 2025-12-04T09:43:32.1019461Z * [new branch] gh/PaulZhang12/31/orig -> origin/gh/PaulZhang12/31/orig 2025-12-04T09:43:32.1020890Z * [new branch] gh/PaulZhang12/37/base -> origin/gh/PaulZhang12/37/base 2025-12-04T09:43:32.1022811Z * [new branch] gh/PaulZhang12/37/head -> origin/gh/PaulZhang12/37/head 2025-12-04T09:43:32.1024544Z * [new branch] gh/PaulZhang12/37/orig -> origin/gh/PaulZhang12/37/orig 2025-12-04T09:43:32.1026969Z * [new branch] gh/PaulZhang12/40/base -> origin/gh/PaulZhang12/40/base 2025-12-04T09:43:32.1028851Z * [new branch] gh/PaulZhang12/40/head -> origin/gh/PaulZhang12/40/head 2025-12-04T09:43:32.1030660Z * [new branch] gh/PaulZhang12/40/orig -> origin/gh/PaulZhang12/40/orig 2025-12-04T09:43:32.1033102Z * [new branch] gh/PaulZhang12/42/base -> origin/gh/PaulZhang12/42/base 2025-12-04T09:43:32.1034669Z * [new branch] gh/PaulZhang12/42/head -> origin/gh/PaulZhang12/42/head 2025-12-04T09:43:32.1037158Z * [new branch] gh/PaulZhang12/43/base -> origin/gh/PaulZhang12/43/base 2025-12-04T09:43:32.1038897Z * [new branch] gh/PaulZhang12/43/head -> origin/gh/PaulZhang12/43/head 2025-12-04T09:43:32.1040737Z * [new branch] gh/PaulZhang12/43/orig -> origin/gh/PaulZhang12/43/orig 2025-12-04T09:43:32.1042932Z * [new branch] gh/PaulZhang12/44/base -> origin/gh/PaulZhang12/44/base 2025-12-04T09:43:32.1044741Z * [new branch] gh/PaulZhang12/44/head -> origin/gh/PaulZhang12/44/head 2025-12-04T09:43:32.1047154Z * [new branch] gh/PaulZhang12/45/base -> origin/gh/PaulZhang12/45/base 2025-12-04T09:43:32.1048940Z * [new branch] gh/PaulZhang12/45/head -> origin/gh/PaulZhang12/45/head 2025-12-04T09:43:32.1050521Z * [new branch] gh/PaulZhang12/45/orig -> origin/gh/PaulZhang12/45/orig 2025-12-04T09:43:32.1052906Z * [new branch] gh/PaulZhang12/46/base -> origin/gh/PaulZhang12/46/base 2025-12-04T09:43:32.1054730Z * [new branch] gh/PaulZhang12/46/head -> origin/gh/PaulZhang12/46/head 2025-12-04T09:43:32.1057833Z * [new branch] gh/PaulZhang12/46/orig -> origin/gh/PaulZhang12/46/orig 2025-12-04T09:43:32.1060271Z * [new branch] gh/PaulZhang12/47/base -> origin/gh/PaulZhang12/47/base 2025-12-04T09:43:32.1062119Z * [new branch] gh/PaulZhang12/47/head -> origin/gh/PaulZhang12/47/head 2025-12-04T09:43:32.1063864Z * [new branch] gh/PaulZhang12/47/orig -> origin/gh/PaulZhang12/47/orig 2025-12-04T09:43:32.1066068Z * [new branch] gh/PaulZhang12/48/base -> origin/gh/PaulZhang12/48/base 2025-12-04T09:43:32.1067908Z * [new branch] gh/PaulZhang12/48/head -> origin/gh/PaulZhang12/48/head 2025-12-04T09:43:32.1069651Z * [new branch] gh/PaulZhang12/48/orig -> origin/gh/PaulZhang12/48/orig 2025-12-04T09:43:32.1072539Z * [new branch] gh/SamGinzburg/11/base -> origin/gh/SamGinzburg/11/base 2025-12-04T09:43:32.1074304Z * [new branch] gh/SamGinzburg/11/head -> origin/gh/SamGinzburg/11/head 2025-12-04T09:43:32.1077149Z * [new branch] gh/SherlockNoMad/1/base -> origin/gh/SherlockNoMad/1/base 2025-12-04T09:43:32.1079017Z * [new branch] gh/SherlockNoMad/1/head -> origin/gh/SherlockNoMad/1/head 2025-12-04T09:43:32.1081380Z * [new branch] gh/SherlockNoMad/10/base -> origin/gh/SherlockNoMad/10/base 2025-12-04T09:43:32.1083221Z * [new branch] gh/SherlockNoMad/10/head -> origin/gh/SherlockNoMad/10/head 2025-12-04T09:43:32.1084987Z * [new branch] gh/SherlockNoMad/10/orig -> origin/gh/SherlockNoMad/10/orig 2025-12-04T09:43:32.1087251Z * [new branch] gh/SherlockNoMad/11/base -> origin/gh/SherlockNoMad/11/base 2025-12-04T09:43:32.1089026Z * [new branch] gh/SherlockNoMad/11/head -> origin/gh/SherlockNoMad/11/head 2025-12-04T09:43:32.1090844Z * [new branch] gh/SherlockNoMad/11/orig -> origin/gh/SherlockNoMad/11/orig 2025-12-04T09:43:32.1092981Z * [new branch] gh/SherlockNoMad/12/base -> origin/gh/SherlockNoMad/12/base 2025-12-04T09:43:32.1094609Z * [new branch] gh/SherlockNoMad/12/head -> origin/gh/SherlockNoMad/12/head 2025-12-04T09:43:32.1096514Z * [new branch] gh/SherlockNoMad/12/orig -> origin/gh/SherlockNoMad/12/orig 2025-12-04T09:43:32.1098868Z * [new branch] gh/SherlockNoMad/15/base -> origin/gh/SherlockNoMad/15/base 2025-12-04T09:43:32.1100479Z * [new branch] gh/SherlockNoMad/15/head -> origin/gh/SherlockNoMad/15/head 2025-12-04T09:43:32.1102353Z * [new branch] gh/SherlockNoMad/15/orig -> origin/gh/SherlockNoMad/15/orig 2025-12-04T09:43:32.1104656Z * [new branch] gh/SherlockNoMad/17/base -> origin/gh/SherlockNoMad/17/base 2025-12-04T09:43:32.1106416Z * [new branch] gh/SherlockNoMad/17/head -> origin/gh/SherlockNoMad/17/head 2025-12-04T09:43:32.1108219Z * [new branch] gh/SherlockNoMad/17/orig -> origin/gh/SherlockNoMad/17/orig 2025-12-04T09:43:32.1110965Z * [new branch] gh/SherlockNoMad/18/base -> origin/gh/SherlockNoMad/18/base 2025-12-04T09:43:32.1112736Z * [new branch] gh/SherlockNoMad/18/head -> origin/gh/SherlockNoMad/18/head 2025-12-04T09:43:32.1114568Z * [new branch] gh/SherlockNoMad/18/orig -> origin/gh/SherlockNoMad/18/orig 2025-12-04T09:43:32.1116703Z * [new branch] gh/SherlockNoMad/19/base -> origin/gh/SherlockNoMad/19/base 2025-12-04T09:43:32.1118534Z * [new branch] gh/SherlockNoMad/19/head -> origin/gh/SherlockNoMad/19/head 2025-12-04T09:43:32.1120308Z * [new branch] gh/SherlockNoMad/19/orig -> origin/gh/SherlockNoMad/19/orig 2025-12-04T09:43:32.1122518Z * [new branch] gh/SherlockNoMad/2/base -> origin/gh/SherlockNoMad/2/base 2025-12-04T09:43:32.1124278Z * [new branch] gh/SherlockNoMad/2/head -> origin/gh/SherlockNoMad/2/head 2025-12-04T09:43:32.1126404Z * [new branch] gh/SherlockNoMad/20/base -> origin/gh/SherlockNoMad/20/base 2025-12-04T09:43:32.1128332Z * [new branch] gh/SherlockNoMad/20/head -> origin/gh/SherlockNoMad/20/head 2025-12-04T09:43:32.1129889Z * [new branch] gh/SherlockNoMad/20/orig -> origin/gh/SherlockNoMad/20/orig 2025-12-04T09:43:32.1132511Z * [new branch] gh/SherlockNoMad/21/base -> origin/gh/SherlockNoMad/21/base 2025-12-04T09:43:32.1134403Z * [new branch] gh/SherlockNoMad/21/head -> origin/gh/SherlockNoMad/21/head 2025-12-04T09:43:32.1135857Z * [new branch] gh/SherlockNoMad/21/orig -> origin/gh/SherlockNoMad/21/orig 2025-12-04T09:43:32.1138341Z * [new branch] gh/SherlockNoMad/3/base -> origin/gh/SherlockNoMad/3/base 2025-12-04T09:43:32.1139982Z * [new branch] gh/SherlockNoMad/3/head -> origin/gh/SherlockNoMad/3/head 2025-12-04T09:43:32.1142291Z * [new branch] gh/SherlockNoMad/4/base -> origin/gh/SherlockNoMad/4/base 2025-12-04T09:43:32.1144001Z * [new branch] gh/SherlockNoMad/4/head -> origin/gh/SherlockNoMad/4/head 2025-12-04T09:43:32.1146243Z * [new branch] gh/SherlockNoMad/5/base -> origin/gh/SherlockNoMad/5/base 2025-12-04T09:43:32.1148115Z * [new branch] gh/SherlockNoMad/5/head -> origin/gh/SherlockNoMad/5/head 2025-12-04T09:43:32.1151455Z * [new branch] gh/Sidharth123-cpu/24/base -> origin/gh/Sidharth123-cpu/24/base 2025-12-04T09:43:32.1153770Z * [new branch] gh/Sidharth123-cpu/25/base -> origin/gh/Sidharth123-cpu/25/base 2025-12-04T09:43:32.1156132Z * [new branch] gh/Sidharth123-cpu/26/base -> origin/gh/Sidharth123-cpu/26/base 2025-12-04T09:43:32.1158672Z * [new branch] gh/Sidharth123-cpu/27/base -> origin/gh/Sidharth123-cpu/27/base 2025-12-04T09:43:32.1161727Z * [new branch] gh/StrongerXi/1/base -> origin/gh/StrongerXi/1/base 2025-12-04T09:43:32.1163235Z * [new branch] gh/StrongerXi/1/head -> origin/gh/StrongerXi/1/head 2025-12-04T09:43:32.1165699Z * [new branch] gh/StrongerXi/71/base -> origin/gh/StrongerXi/71/base 2025-12-04T09:43:32.1167506Z * [new branch] gh/StrongerXi/71/head -> origin/gh/StrongerXi/71/head 2025-12-04T09:43:32.1169834Z * [new branch] gh/StrongerXi/72/base -> origin/gh/StrongerXi/72/base 2025-12-04T09:43:32.1171404Z * [new branch] gh/StrongerXi/72/head -> origin/gh/StrongerXi/72/head 2025-12-04T09:43:32.1173869Z * [new branch] gh/StrongerXi/73/base -> origin/gh/StrongerXi/73/base 2025-12-04T09:43:32.1175629Z * [new branch] gh/StrongerXi/73/head -> origin/gh/StrongerXi/73/head 2025-12-04T09:43:32.1177432Z * [new branch] gh/StrongerXi/73/orig -> origin/gh/StrongerXi/73/orig 2025-12-04T09:43:32.1180472Z * [new branch] gh/XilunWu/160/base -> origin/gh/XilunWu/160/base 2025-12-04T09:43:32.1182077Z * [new branch] gh/XilunWu/160/head -> origin/gh/XilunWu/160/head 2025-12-04T09:43:32.1183642Z * [new branch] gh/XilunWu/160/orig -> origin/gh/XilunWu/160/orig 2025-12-04T09:43:32.1185928Z * [new branch] gh/XilunWu/163/base -> origin/gh/XilunWu/163/base 2025-12-04T09:43:32.1187843Z * [new branch] gh/XilunWu/163/head -> origin/gh/XilunWu/163/head 2025-12-04T09:43:32.1189662Z * [new branch] gh/XilunWu/163/orig -> origin/gh/XilunWu/163/orig 2025-12-04T09:43:32.1192108Z * [new branch] gh/XilunWu/168/base -> origin/gh/XilunWu/168/base 2025-12-04T09:43:32.1193834Z * [new branch] gh/XilunWu/168/head -> origin/gh/XilunWu/168/head 2025-12-04T09:43:32.1195604Z * [new branch] gh/XilunWu/168/orig -> origin/gh/XilunWu/168/orig 2025-12-04T09:43:32.1197974Z * [new branch] gh/XilunWu/169/base -> origin/gh/XilunWu/169/base 2025-12-04T09:43:32.1199801Z * [new branch] gh/XilunWu/169/head -> origin/gh/XilunWu/169/head 2025-12-04T09:43:32.1201550Z * [new branch] gh/XilunWu/169/orig -> origin/gh/XilunWu/169/orig 2025-12-04T09:43:32.1203766Z * [new branch] gh/XilunWu/170/base -> origin/gh/XilunWu/170/base 2025-12-04T09:43:32.1205518Z * [new branch] gh/XilunWu/170/head -> origin/gh/XilunWu/170/head 2025-12-04T09:43:32.1207254Z * [new branch] gh/XilunWu/170/orig -> origin/gh/XilunWu/170/orig 2025-12-04T09:43:32.1209812Z * [new branch] gh/XilunWu/171/base -> origin/gh/XilunWu/171/base 2025-12-04T09:43:32.1211635Z * [new branch] gh/XilunWu/171/head -> origin/gh/XilunWu/171/head 2025-12-04T09:43:32.1213424Z * [new branch] gh/XilunWu/171/orig -> origin/gh/XilunWu/171/orig 2025-12-04T09:43:32.1215788Z * [new branch] gh/XilunWu/173/base -> origin/gh/XilunWu/173/base 2025-12-04T09:43:32.1217648Z * [new branch] gh/XilunWu/173/head -> origin/gh/XilunWu/173/head 2025-12-04T09:43:32.1219397Z * [new branch] gh/XilunWu/173/orig -> origin/gh/XilunWu/173/orig 2025-12-04T09:43:32.1221736Z * [new branch] gh/XilunWu/175/base -> origin/gh/XilunWu/175/base 2025-12-04T09:43:32.1223542Z * [new branch] gh/XilunWu/175/head -> origin/gh/XilunWu/175/head 2025-12-04T09:43:32.1225831Z * [new branch] gh/XilunWu/175/orig -> origin/gh/XilunWu/175/orig 2025-12-04T09:43:32.1228504Z * [new branch] gh/XilunWu/176/base -> origin/gh/XilunWu/176/base 2025-12-04T09:43:32.1230236Z * [new branch] gh/XilunWu/176/head -> origin/gh/XilunWu/176/head 2025-12-04T09:43:32.1232213Z * [new branch] gh/XilunWu/176/orig -> origin/gh/XilunWu/176/orig 2025-12-04T09:43:32.1234973Z * [new branch] gh/XuehaiPan/14/base -> origin/gh/XuehaiPan/14/base 2025-12-04T09:43:32.1236687Z * [new branch] gh/XuehaiPan/14/head -> origin/gh/XuehaiPan/14/head 2025-12-04T09:43:32.1238468Z * [new branch] gh/XuehaiPan/14/orig -> origin/gh/XuehaiPan/14/orig 2025-12-04T09:43:32.1240889Z * [new branch] gh/XuehaiPan/179/base -> origin/gh/XuehaiPan/179/base 2025-12-04T09:43:32.1242661Z * [new branch] gh/XuehaiPan/179/head -> origin/gh/XuehaiPan/179/head 2025-12-04T09:43:32.1244487Z * [new branch] gh/XuehaiPan/179/orig -> origin/gh/XuehaiPan/179/orig 2025-12-04T09:43:32.1246842Z * [new branch] gh/XuehaiPan/249/base -> origin/gh/XuehaiPan/249/base 2025-12-04T09:43:32.1248598Z * [new branch] gh/XuehaiPan/249/head -> origin/gh/XuehaiPan/249/head 2025-12-04T09:43:32.1250399Z * [new branch] gh/XuehaiPan/249/orig -> origin/gh/XuehaiPan/249/orig 2025-12-04T09:43:32.1252722Z * [new branch] gh/XuehaiPan/253/base -> origin/gh/XuehaiPan/253/base 2025-12-04T09:43:32.1254487Z * [new branch] gh/XuehaiPan/253/head -> origin/gh/XuehaiPan/253/head 2025-12-04T09:43:32.1256542Z * [new branch] gh/XuehaiPan/253/orig -> origin/gh/XuehaiPan/253/orig 2025-12-04T09:43:32.1258841Z * [new branch] gh/XuehaiPan/254/base -> origin/gh/XuehaiPan/254/base 2025-12-04T09:43:32.1260582Z * [new branch] gh/XuehaiPan/254/head -> origin/gh/XuehaiPan/254/head 2025-12-04T09:43:32.1262412Z * [new branch] gh/XuehaiPan/254/orig -> origin/gh/XuehaiPan/254/orig 2025-12-04T09:43:32.1264707Z * [new branch] gh/XuehaiPan/255/base -> origin/gh/XuehaiPan/255/base 2025-12-04T09:43:32.1266430Z * [new branch] gh/XuehaiPan/255/head -> origin/gh/XuehaiPan/255/head 2025-12-04T09:43:32.1268326Z * [new branch] gh/XuehaiPan/255/orig -> origin/gh/XuehaiPan/255/orig 2025-12-04T09:43:32.1270783Z * [new branch] gh/XuehaiPan/271/base -> origin/gh/XuehaiPan/271/base 2025-12-04T09:43:32.1272509Z * [new branch] gh/XuehaiPan/271/head -> origin/gh/XuehaiPan/271/head 2025-12-04T09:43:32.1274274Z * [new branch] gh/XuehaiPan/271/orig -> origin/gh/XuehaiPan/271/orig 2025-12-04T09:43:32.1276597Z * [new branch] gh/XuehaiPan/343/base -> origin/gh/XuehaiPan/343/base 2025-12-04T09:43:32.1278340Z * [new branch] gh/XuehaiPan/343/head -> origin/gh/XuehaiPan/343/head 2025-12-04T09:43:32.1280006Z * [new branch] gh/XuehaiPan/343/orig -> origin/gh/XuehaiPan/343/orig 2025-12-04T09:43:32.1282481Z * [new branch] gh/XuehaiPan/347/base -> origin/gh/XuehaiPan/347/base 2025-12-04T09:43:32.1284269Z * [new branch] gh/XuehaiPan/347/head -> origin/gh/XuehaiPan/347/head 2025-12-04T09:43:32.1286000Z * [new branch] gh/XuehaiPan/347/orig -> origin/gh/XuehaiPan/347/orig 2025-12-04T09:43:32.1288331Z * [new branch] gh/XuehaiPan/348/base -> origin/gh/XuehaiPan/348/base 2025-12-04T09:43:32.1290080Z * [new branch] gh/XuehaiPan/348/head -> origin/gh/XuehaiPan/348/head 2025-12-04T09:43:32.1291814Z * [new branch] gh/XuehaiPan/348/orig -> origin/gh/XuehaiPan/348/orig 2025-12-04T09:43:32.1294167Z * [new branch] gh/XuehaiPan/350/base -> origin/gh/XuehaiPan/350/base 2025-12-04T09:43:32.1295873Z * [new branch] gh/XuehaiPan/350/head -> origin/gh/XuehaiPan/350/head 2025-12-04T09:43:32.1297701Z * [new branch] gh/XuehaiPan/350/orig -> origin/gh/XuehaiPan/350/orig 2025-12-04T09:43:32.1300198Z * [new branch] gh/XuehaiPan/365/base -> origin/gh/XuehaiPan/365/base 2025-12-04T09:43:32.1301939Z * [new branch] gh/XuehaiPan/365/head -> origin/gh/XuehaiPan/365/head 2025-12-04T09:43:32.1303646Z * [new branch] gh/XuehaiPan/365/orig -> origin/gh/XuehaiPan/365/orig 2025-12-04T09:43:32.1306063Z * [new branch] gh/XuehaiPan/366/base -> origin/gh/XuehaiPan/366/base 2025-12-04T09:43:32.1307889Z * [new branch] gh/XuehaiPan/366/head -> origin/gh/XuehaiPan/366/head 2025-12-04T09:43:32.1310282Z * [new branch] gh/XuehaiPan/370/base -> origin/gh/XuehaiPan/370/base 2025-12-04T09:43:32.1311986Z * [new branch] gh/XuehaiPan/370/head -> origin/gh/XuehaiPan/370/head 2025-12-04T09:43:32.1313758Z * [new branch] gh/XuehaiPan/370/orig -> origin/gh/XuehaiPan/370/orig 2025-12-04T09:43:32.1316193Z * [new branch] gh/XuehaiPan/390/base -> origin/gh/XuehaiPan/390/base 2025-12-04T09:43:32.1318059Z * [new branch] gh/XuehaiPan/390/head -> origin/gh/XuehaiPan/390/head 2025-12-04T09:43:32.1319922Z * [new branch] gh/XuehaiPan/390/orig -> origin/gh/XuehaiPan/390/orig 2025-12-04T09:43:32.1322352Z * [new branch] gh/XuehaiPan/391/base -> origin/gh/XuehaiPan/391/base 2025-12-04T09:43:32.1324160Z * [new branch] gh/XuehaiPan/391/head -> origin/gh/XuehaiPan/391/head 2025-12-04T09:43:32.1325991Z * [new branch] gh/XuehaiPan/391/orig -> origin/gh/XuehaiPan/391/orig 2025-12-04T09:43:32.1328298Z * [new branch] gh/XuehaiPan/392/base -> origin/gh/XuehaiPan/392/base 2025-12-04T09:43:32.1330061Z * [new branch] gh/XuehaiPan/392/head -> origin/gh/XuehaiPan/392/head 2025-12-04T09:43:32.1331726Z * [new branch] gh/XuehaiPan/392/orig -> origin/gh/XuehaiPan/392/orig 2025-12-04T09:43:32.1334596Z * [new branch] gh/XuehaiPan/394/base -> origin/gh/XuehaiPan/394/base 2025-12-04T09:43:32.1336384Z * [new branch] gh/XuehaiPan/394/head -> origin/gh/XuehaiPan/394/head 2025-12-04T09:43:32.1338107Z * [new branch] gh/XuehaiPan/394/orig -> origin/gh/XuehaiPan/394/orig 2025-12-04T09:43:32.1340557Z * [new branch] gh/XuehaiPan/397/base -> origin/gh/XuehaiPan/397/base 2025-12-04T09:43:32.1342703Z * [new branch] gh/XuehaiPan/397/head -> origin/gh/XuehaiPan/397/head 2025-12-04T09:43:32.1344478Z * [new branch] gh/XuehaiPan/397/orig -> origin/gh/XuehaiPan/397/orig 2025-12-04T09:43:32.1346961Z * [new branch] gh/XuehaiPan/398/base -> origin/gh/XuehaiPan/398/base 2025-12-04T09:43:32.1348843Z * [new branch] gh/XuehaiPan/398/head -> origin/gh/XuehaiPan/398/head 2025-12-04T09:43:32.1350628Z * [new branch] gh/XuehaiPan/398/orig -> origin/gh/XuehaiPan/398/orig 2025-12-04T09:43:32.1352946Z * [new branch] gh/XuehaiPan/399/base -> origin/gh/XuehaiPan/399/base 2025-12-04T09:43:32.1354725Z * [new branch] gh/XuehaiPan/399/head -> origin/gh/XuehaiPan/399/head 2025-12-04T09:43:32.1356873Z * [new branch] gh/XuehaiPan/399/orig -> origin/gh/XuehaiPan/399/orig 2025-12-04T09:43:32.1359355Z * [new branch] gh/XuehaiPan/400/base -> origin/gh/XuehaiPan/400/base 2025-12-04T09:43:32.1361099Z * [new branch] gh/XuehaiPan/400/head -> origin/gh/XuehaiPan/400/head 2025-12-04T09:43:32.1362888Z * [new branch] gh/XuehaiPan/400/orig -> origin/gh/XuehaiPan/400/orig 2025-12-04T09:43:32.1365803Z * [new branch] gh/ZhiweiYan-96/39/base -> origin/gh/ZhiweiYan-96/39/base 2025-12-04T09:43:32.1367521Z * [new branch] gh/ZhiweiYan-96/39/head -> origin/gh/ZhiweiYan-96/39/head 2025-12-04T09:43:32.1369384Z * [new branch] gh/ZhiweiYan-96/39/orig -> origin/gh/ZhiweiYan-96/39/orig 2025-12-04T09:43:32.1371872Z * [new branch] gh/ZhiweiYan-96/44/base -> origin/gh/ZhiweiYan-96/44/base 2025-12-04T09:43:32.1373524Z * [new branch] gh/ZhiweiYan-96/44/head -> origin/gh/ZhiweiYan-96/44/head 2025-12-04T09:43:32.1375821Z * [new branch] gh/ZhiweiYan-96/45/base -> origin/gh/ZhiweiYan-96/45/base 2025-12-04T09:43:32.1377620Z * [new branch] gh/ZhiweiYan-96/45/head -> origin/gh/ZhiweiYan-96/45/head 2025-12-04T09:43:32.1380031Z * [new branch] gh/ZhiweiYan-96/49/base -> origin/gh/ZhiweiYan-96/49/base 2025-12-04T09:43:32.1381769Z * [new branch] gh/ZhiweiYan-96/49/head -> origin/gh/ZhiweiYan-96/49/head 2025-12-04T09:43:32.1384092Z * [new branch] gh/ZhiweiYan-96/62/base -> origin/gh/ZhiweiYan-96/62/base 2025-12-04T09:43:32.1385786Z * [new branch] gh/ZhiweiYan-96/62/head -> origin/gh/ZhiweiYan-96/62/head 2025-12-04T09:43:32.1388470Z * [new branch] gh/ZhiweiYan-96/66/base -> origin/gh/ZhiweiYan-96/66/base 2025-12-04T09:43:32.1390297Z * [new branch] gh/ZhiweiYan-96/66/head -> origin/gh/ZhiweiYan-96/66/head 2025-12-04T09:43:32.1392649Z * [new branch] gh/ZhiweiYan-96/67/base -> origin/gh/ZhiweiYan-96/67/base 2025-12-04T09:43:32.1394349Z * [new branch] gh/ZhiweiYan-96/67/head -> origin/gh/ZhiweiYan-96/67/head 2025-12-04T09:43:32.1396580Z * [new branch] gh/ZhiweiYan-96/68/base -> origin/gh/ZhiweiYan-96/68/base 2025-12-04T09:43:32.1398270Z * [new branch] gh/ZhiweiYan-96/68/head -> origin/gh/ZhiweiYan-96/68/head 2025-12-04T09:43:32.1400007Z * [new branch] gh/ZhiweiYan-96/68/orig -> origin/gh/ZhiweiYan-96/68/orig 2025-12-04T09:43:32.1402902Z * [new branch] gh/aakhundov/1/base -> origin/gh/aakhundov/1/base 2025-12-04T09:43:32.1404750Z * [new branch] gh/aakhundov/1/head -> origin/gh/aakhundov/1/head 2025-12-04T09:43:32.1407012Z * [new branch] gh/aakhundov/2/base -> origin/gh/aakhundov/2/base 2025-12-04T09:43:32.1408757Z * [new branch] gh/aakhundov/2/head -> origin/gh/aakhundov/2/head 2025-12-04T09:43:32.1411199Z * [new branch] gh/aditew01/openblas -> origin/gh/aditew01/openblas 2025-12-04T09:43:32.1412950Z * [new branch] gh/aditew01/sbgemm -> origin/gh/aditew01/sbgemm 2025-12-04T09:43:32.1414765Z * [new branch] gh/aditew01/vecbf16 -> origin/gh/aditew01/vecbf16 2025-12-04T09:43:32.1417518Z * [new branch] gh/albanD/4/base -> origin/gh/albanD/4/base 2025-12-04T09:43:32.1419327Z * [new branch] gh/albanD/4/head -> origin/gh/albanD/4/head 2025-12-04T09:43:32.1421087Z * [new branch] gh/albanD/4/orig -> origin/gh/albanD/4/orig 2025-12-04T09:43:32.1423644Z * [new branch] gh/alexbrauckmann/paddedtensor_faketensor_init -> origin/gh/alexbrauckmann/paddedtensor_faketensor_init 2025-12-04T09:43:32.1426442Z * [new branch] gh/alexsamardzic/12/base -> origin/gh/alexsamardzic/12/base 2025-12-04T09:43:32.1428343Z * [new branch] gh/alexsamardzic/12/head -> origin/gh/alexsamardzic/12/head 2025-12-04T09:43:32.1430168Z * [new branch] gh/alexsamardzic/12/orig -> origin/gh/alexsamardzic/12/orig 2025-12-04T09:43:32.1432485Z * [new branch] gh/alexsamardzic/14/base -> origin/gh/alexsamardzic/14/base 2025-12-04T09:43:32.1434320Z * [new branch] gh/alexsamardzic/14/head -> origin/gh/alexsamardzic/14/head 2025-12-04T09:43:32.1436075Z * [new branch] gh/alexsamardzic/14/orig -> origin/gh/alexsamardzic/14/orig 2025-12-04T09:43:32.1438405Z * [new branch] gh/alexsamardzic/15/base -> origin/gh/alexsamardzic/15/base 2025-12-04T09:43:32.1440099Z * [new branch] gh/alexsamardzic/15/head -> origin/gh/alexsamardzic/15/head 2025-12-04T09:43:32.1442025Z * [new branch] gh/alexsamardzic/15/orig -> origin/gh/alexsamardzic/15/orig 2025-12-04T09:43:32.1444812Z * [new branch] gh/amjames/18/base -> origin/gh/amjames/18/base 2025-12-04T09:43:32.1446546Z * [new branch] gh/amjames/18/head -> origin/gh/amjames/18/head 2025-12-04T09:43:32.1448303Z * [new branch] gh/amjames/18/orig -> origin/gh/amjames/18/orig 2025-12-04T09:43:32.1451330Z * [new branch] gh/andrewor14/35/base -> origin/gh/andrewor14/35/base 2025-12-04T09:43:32.1453235Z * [new branch] gh/andrewor14/35/head -> origin/gh/andrewor14/35/head 2025-12-04T09:43:32.1455134Z * [new branch] gh/andrewor14/35/orig -> origin/gh/andrewor14/35/orig 2025-12-04T09:43:32.1457922Z * [new branch] gh/andrewor14/50/base -> origin/gh/andrewor14/50/base 2025-12-04T09:43:32.1459733Z * [new branch] gh/andrewor14/50/head -> origin/gh/andrewor14/50/head 2025-12-04T09:43:32.1461492Z * [new branch] gh/andrewor14/50/orig -> origin/gh/andrewor14/50/orig 2025-12-04T09:43:32.1464376Z * [new branch] gh/andyanwang/30/base -> origin/gh/andyanwang/30/base 2025-12-04T09:43:32.1466260Z * [new branch] gh/andyanwang/30/orig -> origin/gh/andyanwang/30/orig 2025-12-04T09:43:32.1468799Z * [new branch] gh/andyanwang/31/base -> origin/gh/andyanwang/31/base 2025-12-04T09:43:32.1470813Z * [new branch] gh/andyanwang/31/orig -> origin/gh/andyanwang/31/orig 2025-12-04T09:43:32.1473194Z * [new branch] gh/andyanwang/39/base -> origin/gh/andyanwang/39/base 2025-12-04T09:43:32.1474986Z * [new branch] gh/andyanwang/39/head -> origin/gh/andyanwang/39/head 2025-12-04T09:43:32.1476779Z * [new branch] gh/andyanwang/39/orig -> origin/gh/andyanwang/39/orig 2025-12-04T09:43:32.1479318Z * [new branch] gh/andyanwang/42/base -> origin/gh/andyanwang/42/base 2025-12-04T09:43:32.1481028Z * [new branch] gh/andyanwang/42/head -> origin/gh/andyanwang/42/head 2025-12-04T09:43:32.1482807Z * [new branch] gh/andyanwang/42/orig -> origin/gh/andyanwang/42/orig 2025-12-04T09:43:32.1485249Z * [new branch] gh/andyanwang/45/base -> origin/gh/andyanwang/45/base 2025-12-04T09:43:32.1487096Z * [new branch] gh/andyanwang/45/head -> origin/gh/andyanwang/45/head 2025-12-04T09:43:32.1489044Z * [new branch] gh/andyanwang/45/orig -> origin/gh/andyanwang/45/orig 2025-12-04T09:43:32.1491797Z * [new branch] gh/angelayi/107/base -> origin/gh/angelayi/107/base 2025-12-04T09:43:32.1493544Z * [new branch] gh/angelayi/107/head -> origin/gh/angelayi/107/head 2025-12-04T09:43:32.1495907Z * [new branch] gh/angelayi/114/base -> origin/gh/angelayi/114/base 2025-12-04T09:43:32.1497720Z * [new branch] gh/angelayi/114/head -> origin/gh/angelayi/114/head 2025-12-04T09:43:32.1499483Z * [new branch] gh/angelayi/114/orig -> origin/gh/angelayi/114/orig 2025-12-04T09:43:32.1501825Z * [new branch] gh/angelayi/116/base -> origin/gh/angelayi/116/base 2025-12-04T09:43:32.1503629Z * [new branch] gh/angelayi/116/head -> origin/gh/angelayi/116/head 2025-12-04T09:43:32.1505387Z * [new branch] gh/angelayi/116/orig -> origin/gh/angelayi/116/orig 2025-12-04T09:43:32.1507901Z * [new branch] gh/angelayi/122/base -> origin/gh/angelayi/122/base 2025-12-04T09:43:32.1509681Z * [new branch] gh/angelayi/122/head -> origin/gh/angelayi/122/head 2025-12-04T09:43:32.1511414Z * [new branch] gh/angelayi/122/orig -> origin/gh/angelayi/122/orig 2025-12-04T09:43:32.1513900Z * [new branch] gh/angelayi/124/base -> origin/gh/angelayi/124/base 2025-12-04T09:43:32.1515764Z * [new branch] gh/angelayi/124/head -> origin/gh/angelayi/124/head 2025-12-04T09:43:32.1517405Z * [new branch] gh/angelayi/124/orig -> origin/gh/angelayi/124/orig 2025-12-04T09:43:32.1519900Z * [new branch] gh/angelayi/128/base -> origin/gh/angelayi/128/base 2025-12-04T09:43:32.1521699Z * [new branch] gh/angelayi/128/head -> origin/gh/angelayi/128/head 2025-12-04T09:43:32.1523449Z * [new branch] gh/angelayi/128/orig -> origin/gh/angelayi/128/orig 2025-12-04T09:43:32.1525843Z * [new branch] gh/angelayi/131/base -> origin/gh/angelayi/131/base 2025-12-04T09:43:32.1527596Z * [new branch] gh/angelayi/131/head -> origin/gh/angelayi/131/head 2025-12-04T09:43:32.1529331Z * [new branch] gh/angelayi/131/orig -> origin/gh/angelayi/131/orig 2025-12-04T09:43:32.1532305Z * [new branch] gh/angelayi/132/base -> origin/gh/angelayi/132/base 2025-12-04T09:43:32.1534001Z * [new branch] gh/angelayi/132/head -> origin/gh/angelayi/132/head 2025-12-04T09:43:32.1535871Z * [new branch] gh/angelayi/132/orig -> origin/gh/angelayi/132/orig 2025-12-04T09:43:32.1538157Z * [new branch] gh/angelayi/133/base -> origin/gh/angelayi/133/base 2025-12-04T09:43:32.1539937Z * [new branch] gh/angelayi/133/head -> origin/gh/angelayi/133/head 2025-12-04T09:43:32.1541711Z * [new branch] gh/angelayi/133/orig -> origin/gh/angelayi/133/orig 2025-12-04T09:43:32.1544228Z * [new branch] gh/angelayi/134/base -> origin/gh/angelayi/134/base 2025-12-04T09:43:32.1546085Z * [new branch] gh/angelayi/134/head -> origin/gh/angelayi/134/head 2025-12-04T09:43:32.1547984Z * [new branch] gh/angelayi/134/orig -> origin/gh/angelayi/134/orig 2025-12-04T09:43:32.1550518Z * [new branch] gh/angelayi/135/base -> origin/gh/angelayi/135/base 2025-12-04T09:43:32.1552294Z * [new branch] gh/angelayi/135/head -> origin/gh/angelayi/135/head 2025-12-04T09:43:32.1554041Z * [new branch] gh/angelayi/135/orig -> origin/gh/angelayi/135/orig 2025-12-04T09:43:32.1557347Z * [new branch] gh/angelayi/136/base -> origin/gh/angelayi/136/base 2025-12-04T09:43:32.1559102Z * [new branch] gh/angelayi/136/head -> origin/gh/angelayi/136/head 2025-12-04T09:43:32.1560788Z * [new branch] gh/angelayi/136/orig -> origin/gh/angelayi/136/orig 2025-12-04T09:43:32.1563345Z * [new branch] gh/angelayi/137/base -> origin/gh/angelayi/137/base 2025-12-04T09:43:32.1565042Z * [new branch] gh/angelayi/137/head -> origin/gh/angelayi/137/head 2025-12-04T09:43:32.1566951Z * [new branch] gh/angelayi/137/orig -> origin/gh/angelayi/137/orig 2025-12-04T09:43:32.1569264Z * [new branch] gh/angelayi/138/base -> origin/gh/angelayi/138/base 2025-12-04T09:43:32.1570970Z * [new branch] gh/angelayi/138/head -> origin/gh/angelayi/138/head 2025-12-04T09:43:32.1572717Z * [new branch] gh/angelayi/138/orig -> origin/gh/angelayi/138/orig 2025-12-04T09:43:32.1575058Z * [new branch] gh/angelayi/139/base -> origin/gh/angelayi/139/base 2025-12-04T09:43:32.1576864Z * [new branch] gh/angelayi/139/head -> origin/gh/angelayi/139/head 2025-12-04T09:43:32.1578639Z * [new branch] gh/angelayi/139/orig -> origin/gh/angelayi/139/orig 2025-12-04T09:43:32.1581125Z * [new branch] gh/angelayi/140/base -> origin/gh/angelayi/140/base 2025-12-04T09:43:32.1582963Z * [new branch] gh/angelayi/140/head -> origin/gh/angelayi/140/head 2025-12-04T09:43:32.1584789Z * [new branch] gh/angelayi/140/orig -> origin/gh/angelayi/140/orig 2025-12-04T09:43:32.1587935Z * [new branch] gh/angelayi/141/base -> origin/gh/angelayi/141/base 2025-12-04T09:43:32.1589632Z * [new branch] gh/angelayi/141/head -> origin/gh/angelayi/141/head 2025-12-04T09:43:32.1591366Z * [new branch] gh/angelayi/141/orig -> origin/gh/angelayi/141/orig 2025-12-04T09:43:32.1593848Z * [new branch] gh/angelayi/142/base -> origin/gh/angelayi/142/base 2025-12-04T09:43:32.1595702Z * [new branch] gh/angelayi/142/head -> origin/gh/angelayi/142/head 2025-12-04T09:43:32.1597449Z * [new branch] gh/angelayi/142/orig -> origin/gh/angelayi/142/orig 2025-12-04T09:43:32.1599884Z * [new branch] gh/angelayi/143/base -> origin/gh/angelayi/143/base 2025-12-04T09:43:32.1601649Z * [new branch] gh/angelayi/143/head -> origin/gh/angelayi/143/head 2025-12-04T09:43:32.1603421Z * [new branch] gh/angelayi/143/orig -> origin/gh/angelayi/143/orig 2025-12-04T09:43:32.1605883Z * [new branch] gh/angelayi/144/base -> origin/gh/angelayi/144/base 2025-12-04T09:43:32.1607819Z * [new branch] gh/angelayi/144/head -> origin/gh/angelayi/144/head 2025-12-04T09:43:32.1609606Z * [new branch] gh/angelayi/144/orig -> origin/gh/angelayi/144/orig 2025-12-04T09:43:32.1612548Z * [new branch] gh/anijain2305/753/base -> origin/gh/anijain2305/753/base 2025-12-04T09:43:32.1614292Z * [new branch] gh/anijain2305/753/head -> origin/gh/anijain2305/753/head 2025-12-04T09:43:32.1616056Z * [new branch] gh/anijain2305/753/orig -> origin/gh/anijain2305/753/orig 2025-12-04T09:43:32.1618521Z * [new branch] gh/anijain2305/810/base -> origin/gh/anijain2305/810/base 2025-12-04T09:43:32.1620337Z * [new branch] gh/anijain2305/810/head -> origin/gh/anijain2305/810/head 2025-12-04T09:43:32.1622101Z * [new branch] gh/anijain2305/810/orig -> origin/gh/anijain2305/810/orig 2025-12-04T09:43:32.1624519Z * [new branch] gh/anijain2305/854/base -> origin/gh/anijain2305/854/base 2025-12-04T09:43:32.1626321Z * [new branch] gh/anijain2305/854/head -> origin/gh/anijain2305/854/head 2025-12-04T09:43:32.1628205Z * [new branch] gh/anijain2305/854/orig -> origin/gh/anijain2305/854/orig 2025-12-04T09:43:32.1630767Z * [new branch] gh/anijain2305/864/base -> origin/gh/anijain2305/864/base 2025-12-04T09:43:32.1632523Z * [new branch] gh/anijain2305/864/head -> origin/gh/anijain2305/864/head 2025-12-04T09:43:32.1634279Z * [new branch] gh/anijain2305/864/orig -> origin/gh/anijain2305/864/orig 2025-12-04T09:43:32.1636665Z * [new branch] gh/anijain2305/870/base -> origin/gh/anijain2305/870/base 2025-12-04T09:43:32.1638408Z * [new branch] gh/anijain2305/870/head -> origin/gh/anijain2305/870/head 2025-12-04T09:43:32.1640131Z * [new branch] gh/anijain2305/870/orig -> origin/gh/anijain2305/870/orig 2025-12-04T09:43:32.1642521Z * [new branch] gh/anijain2305/873/base -> origin/gh/anijain2305/873/base 2025-12-04T09:43:32.1644215Z * [new branch] gh/anijain2305/873/head -> origin/gh/anijain2305/873/head 2025-12-04T09:43:32.1646043Z * [new branch] gh/anijain2305/873/orig -> origin/gh/anijain2305/873/orig 2025-12-04T09:43:32.1648414Z * [new branch] gh/anijain2305/894/base -> origin/gh/anijain2305/894/base 2025-12-04T09:43:32.1650175Z * [new branch] gh/anijain2305/894/head -> origin/gh/anijain2305/894/head 2025-12-04T09:43:32.1651965Z * [new branch] gh/anijain2305/894/orig -> origin/gh/anijain2305/894/orig 2025-12-04T09:43:32.1654378Z * [new branch] gh/anijain2305/895/base -> origin/gh/anijain2305/895/base 2025-12-04T09:43:32.1656367Z * [new branch] gh/anijain2305/895/head -> origin/gh/anijain2305/895/head 2025-12-04T09:43:32.1658241Z * [new branch] gh/anijain2305/895/orig -> origin/gh/anijain2305/895/orig 2025-12-04T09:43:32.1660601Z * [new branch] gh/anijain2305/910/base -> origin/gh/anijain2305/910/base 2025-12-04T09:43:32.1662372Z * [new branch] gh/anijain2305/910/head -> origin/gh/anijain2305/910/head 2025-12-04T09:43:32.1664097Z * [new branch] gh/anijain2305/910/orig -> origin/gh/anijain2305/910/orig 2025-12-04T09:43:32.1666476Z * [new branch] gh/anijain2305/919/base -> origin/gh/anijain2305/919/base 2025-12-04T09:43:32.1668353Z * [new branch] gh/anijain2305/919/head -> origin/gh/anijain2305/919/head 2025-12-04T09:43:32.1670051Z * [new branch] gh/anijain2305/919/orig -> origin/gh/anijain2305/919/orig 2025-12-04T09:43:32.1672456Z * [new branch] gh/anijain2305/922/base -> origin/gh/anijain2305/922/base 2025-12-04T09:43:32.1674230Z * [new branch] gh/anijain2305/922/head -> origin/gh/anijain2305/922/head 2025-12-04T09:43:32.1676040Z * [new branch] gh/anijain2305/922/orig -> origin/gh/anijain2305/922/orig 2025-12-04T09:43:32.1678434Z * [new branch] gh/anijain2305/932/base -> origin/gh/anijain2305/932/base 2025-12-04T09:43:32.1680320Z * [new branch] gh/anijain2305/932/head -> origin/gh/anijain2305/932/head 2025-12-04T09:43:32.1682150Z * [new branch] gh/anijain2305/932/orig -> origin/gh/anijain2305/932/orig 2025-12-04T09:43:32.1684524Z * [new branch] gh/anijain2305/940/base -> origin/gh/anijain2305/940/base 2025-12-04T09:43:32.1686283Z * [new branch] gh/anijain2305/940/head -> origin/gh/anijain2305/940/head 2025-12-04T09:43:32.1688075Z * [new branch] gh/anijain2305/940/orig -> origin/gh/anijain2305/940/orig 2025-12-04T09:43:32.1690481Z * [new branch] gh/anijain2305/941/base -> origin/gh/anijain2305/941/base 2025-12-04T09:43:32.1692216Z * [new branch] gh/anijain2305/941/head -> origin/gh/anijain2305/941/head 2025-12-04T09:43:32.1694006Z * [new branch] gh/anijain2305/941/orig -> origin/gh/anijain2305/941/orig 2025-12-04T09:43:32.1696386Z * [new branch] gh/anijain2305/942/base -> origin/gh/anijain2305/942/base 2025-12-04T09:43:32.1698185Z * [new branch] gh/anijain2305/942/head -> origin/gh/anijain2305/942/head 2025-12-04T09:43:32.1700033Z * [new branch] gh/anijain2305/942/orig -> origin/gh/anijain2305/942/orig 2025-12-04T09:43:32.1702450Z * [new branch] gh/anijain2305/943/base -> origin/gh/anijain2305/943/base 2025-12-04T09:43:32.1704284Z * [new branch] gh/anijain2305/943/head -> origin/gh/anijain2305/943/head 2025-12-04T09:43:32.1706012Z * [new branch] gh/anijain2305/943/orig -> origin/gh/anijain2305/943/orig 2025-12-04T09:43:32.1709250Z * [new branch] gh/anijain2305/944/base -> origin/gh/anijain2305/944/base 2025-12-04T09:43:32.1710979Z * [new branch] gh/anijain2305/944/head -> origin/gh/anijain2305/944/head 2025-12-04T09:43:32.1713188Z * [new branch] gh/anijain2305/944/orig -> origin/gh/anijain2305/944/orig 2025-12-04T09:43:32.1715630Z * [new branch] gh/anijain2305/945/base -> origin/gh/anijain2305/945/base 2025-12-04T09:43:32.1717432Z * [new branch] gh/anijain2305/945/head -> origin/gh/anijain2305/945/head 2025-12-04T09:43:32.1719196Z * [new branch] gh/anijain2305/945/orig -> origin/gh/anijain2305/945/orig 2025-12-04T09:43:32.1721669Z * [new branch] gh/anijain2305/946/base -> origin/gh/anijain2305/946/base 2025-12-04T09:43:32.1723415Z * [new branch] gh/anijain2305/946/head -> origin/gh/anijain2305/946/head 2025-12-04T09:43:32.1725233Z * [new branch] gh/anijain2305/946/orig -> origin/gh/anijain2305/946/orig 2025-12-04T09:43:32.1727709Z * [new branch] gh/anijain2305/947/base -> origin/gh/anijain2305/947/base 2025-12-04T09:43:32.1729420Z * [new branch] gh/anijain2305/947/head -> origin/gh/anijain2305/947/head 2025-12-04T09:43:32.1731148Z * [new branch] gh/anijain2305/947/orig -> origin/gh/anijain2305/947/orig 2025-12-04T09:43:32.1733640Z * [new branch] gh/anijain2305/948/base -> origin/gh/anijain2305/948/base 2025-12-04T09:43:32.1735375Z * [new branch] gh/anijain2305/948/head -> origin/gh/anijain2305/948/head 2025-12-04T09:43:32.1737090Z * [new branch] gh/anijain2305/948/orig -> origin/gh/anijain2305/948/orig 2025-12-04T09:43:32.1739546Z * [new branch] gh/anijain2305/949/base -> origin/gh/anijain2305/949/base 2025-12-04T09:43:32.1741479Z * [new branch] gh/anijain2305/949/head -> origin/gh/anijain2305/949/head 2025-12-04T09:43:32.1743229Z * [new branch] gh/anijain2305/949/orig -> origin/gh/anijain2305/949/orig 2025-12-04T09:43:32.1745734Z * [new branch] gh/anijain2305/950/base -> origin/gh/anijain2305/950/base 2025-12-04T09:43:32.1747626Z * [new branch] gh/anijain2305/950/head -> origin/gh/anijain2305/950/head 2025-12-04T09:43:32.1749382Z * [new branch] gh/anijain2305/950/orig -> origin/gh/anijain2305/950/orig 2025-12-04T09:43:32.1751865Z * [new branch] gh/anijain2305/951/base -> origin/gh/anijain2305/951/base 2025-12-04T09:43:32.1753589Z * [new branch] gh/anijain2305/951/head -> origin/gh/anijain2305/951/head 2025-12-04T09:43:32.1755672Z * [new branch] gh/anijain2305/951/orig -> origin/gh/anijain2305/951/orig 2025-12-04T09:43:32.1759399Z * [new branch] gh/anijain2305/952/base -> origin/gh/anijain2305/952/base 2025-12-04T09:43:32.1761685Z * [new branch] gh/anijain2305/952/head -> origin/gh/anijain2305/952/head 2025-12-04T09:43:32.1763912Z * [new branch] gh/anijain2305/952/orig -> origin/gh/anijain2305/952/orig 2025-12-04T09:43:32.1766937Z * [new branch] gh/anijain2305/953/base -> origin/gh/anijain2305/953/base 2025-12-04T09:43:32.1769099Z * [new branch] gh/anijain2305/953/head -> origin/gh/anijain2305/953/head 2025-12-04T09:43:32.1771211Z * [new branch] gh/anijain2305/953/orig -> origin/gh/anijain2305/953/orig 2025-12-04T09:43:32.1773911Z * [new branch] gh/anijain2305/954/base -> origin/gh/anijain2305/954/base 2025-12-04T09:43:32.1775557Z * [new branch] gh/anijain2305/954/head -> origin/gh/anijain2305/954/head 2025-12-04T09:43:32.1777351Z * [new branch] gh/anijain2305/954/orig -> origin/gh/anijain2305/954/orig 2025-12-04T09:43:32.1779823Z * [new branch] gh/anijain2305/955/base -> origin/gh/anijain2305/955/base 2025-12-04T09:43:32.1781524Z * [new branch] gh/anijain2305/955/head -> origin/gh/anijain2305/955/head 2025-12-04T09:43:32.1783263Z * [new branch] gh/anijain2305/955/orig -> origin/gh/anijain2305/955/orig 2025-12-04T09:43:32.1785733Z * [new branch] gh/anijain2305/956/base -> origin/gh/anijain2305/956/base 2025-12-04T09:43:32.1787542Z * [new branch] gh/anijain2305/956/head -> origin/gh/anijain2305/956/head 2025-12-04T09:43:32.1789431Z * [new branch] gh/anijain2305/956/orig -> origin/gh/anijain2305/956/orig 2025-12-04T09:43:32.1791962Z * [new branch] gh/anijain2305/957/base -> origin/gh/anijain2305/957/base 2025-12-04T09:43:32.1793738Z * [new branch] gh/anijain2305/957/head -> origin/gh/anijain2305/957/head 2025-12-04T09:43:32.1795527Z * [new branch] gh/anijain2305/957/orig -> origin/gh/anijain2305/957/orig 2025-12-04T09:43:32.1798003Z * [new branch] gh/anijain2305/958/base -> origin/gh/anijain2305/958/base 2025-12-04T09:43:32.1800075Z * [new branch] gh/anijain2305/958/head -> origin/gh/anijain2305/958/head 2025-12-04T09:43:32.1801510Z * [new branch] gh/anijain2305/958/orig -> origin/gh/anijain2305/958/orig 2025-12-04T09:43:32.1804020Z * [new branch] gh/anijain2305/959/base -> origin/gh/anijain2305/959/base 2025-12-04T09:43:32.1805756Z * [new branch] gh/anijain2305/959/head -> origin/gh/anijain2305/959/head 2025-12-04T09:43:32.1807509Z * [new branch] gh/anijain2305/959/orig -> origin/gh/anijain2305/959/orig 2025-12-04T09:43:32.1810029Z * [new branch] gh/anijain2305/960/base -> origin/gh/anijain2305/960/base 2025-12-04T09:43:32.1811774Z * [new branch] gh/anijain2305/960/head -> origin/gh/anijain2305/960/head 2025-12-04T09:43:32.1813504Z * [new branch] gh/anijain2305/960/orig -> origin/gh/anijain2305/960/orig 2025-12-04T09:43:32.1815997Z * [new branch] gh/anijain2305/961/base -> origin/gh/anijain2305/961/base 2025-12-04T09:43:32.1817783Z * [new branch] gh/anijain2305/961/head -> origin/gh/anijain2305/961/head 2025-12-04T09:43:32.1819501Z * [new branch] gh/anijain2305/961/orig -> origin/gh/anijain2305/961/orig 2025-12-04T09:43:32.1821907Z * [new branch] gh/anijain2305/962/base -> origin/gh/anijain2305/962/base 2025-12-04T09:43:32.1823639Z * [new branch] gh/anijain2305/962/head -> origin/gh/anijain2305/962/head 2025-12-04T09:43:32.1825403Z * [new branch] gh/anijain2305/962/orig -> origin/gh/anijain2305/962/orig 2025-12-04T09:43:32.1828264Z * [new branch] gh/anijain2305/963/base -> origin/gh/anijain2305/963/base 2025-12-04T09:43:32.1830179Z * [new branch] gh/anijain2305/963/head -> origin/gh/anijain2305/963/head 2025-12-04T09:43:32.1832089Z * [new branch] gh/anijain2305/963/orig -> origin/gh/anijain2305/963/orig 2025-12-04T09:43:32.1834481Z * [new branch] gh/anijain2305/964/base -> origin/gh/anijain2305/964/base 2025-12-04T09:43:32.1836227Z * [new branch] gh/anijain2305/964/head -> origin/gh/anijain2305/964/head 2025-12-04T09:43:32.1837958Z * [new branch] gh/anijain2305/964/orig -> origin/gh/anijain2305/964/orig 2025-12-04T09:43:32.1840412Z * [new branch] gh/anijain2305/965/base -> origin/gh/anijain2305/965/base 2025-12-04T09:43:32.1842138Z * [new branch] gh/anijain2305/965/head -> origin/gh/anijain2305/965/head 2025-12-04T09:43:32.1843940Z * [new branch] gh/anijain2305/965/orig -> origin/gh/anijain2305/965/orig 2025-12-04T09:43:32.1846219Z * [new branch] gh/anijain2305/966/base -> origin/gh/anijain2305/966/base 2025-12-04T09:43:32.1847997Z * [new branch] gh/anijain2305/966/head -> origin/gh/anijain2305/966/head 2025-12-04T09:43:32.1849856Z * [new branch] gh/anijain2305/966/orig -> origin/gh/anijain2305/966/orig 2025-12-04T09:43:32.1852292Z * [new branch] gh/anijain2305/967/base -> origin/gh/anijain2305/967/base 2025-12-04T09:43:32.1854015Z * [new branch] gh/anijain2305/967/head -> origin/gh/anijain2305/967/head 2025-12-04T09:43:32.1856119Z * [new branch] gh/anijain2305/967/orig -> origin/gh/anijain2305/967/orig 2025-12-04T09:43:32.1858505Z * [new branch] gh/anijain2305/968/base -> origin/gh/anijain2305/968/base 2025-12-04T09:43:32.1860290Z * [new branch] gh/anijain2305/968/head -> origin/gh/anijain2305/968/head 2025-12-04T09:43:32.1862025Z * [new branch] gh/anijain2305/968/orig -> origin/gh/anijain2305/968/orig 2025-12-04T09:43:32.1864512Z * [new branch] gh/anijain2305/969/base -> origin/gh/anijain2305/969/base 2025-12-04T09:43:32.1866336Z * [new branch] gh/anijain2305/969/head -> origin/gh/anijain2305/969/head 2025-12-04T09:43:32.1868704Z * [new branch] gh/anijain2305/969/orig -> origin/gh/anijain2305/969/orig 2025-12-04T09:43:32.1870731Z * [new branch] gh/anijain2305/970/base -> origin/gh/anijain2305/970/base 2025-12-04T09:43:32.1872478Z * [new branch] gh/anijain2305/970/head -> origin/gh/anijain2305/970/head 2025-12-04T09:43:32.1874340Z * [new branch] gh/anijain2305/970/orig -> origin/gh/anijain2305/970/orig 2025-12-04T09:43:32.1877198Z * [new branch] gh/anjali411/216/base -> origin/gh/anjali411/216/base 2025-12-04T09:43:32.1878927Z * [new branch] gh/anjali411/216/head -> origin/gh/anjali411/216/head 2025-12-04T09:43:32.1880676Z * [new branch] gh/anjali411/216/orig -> origin/gh/anjali411/216/orig 2025-12-04T09:43:32.1883588Z * [new branch] gh/anshul-si/1/base -> origin/gh/anshul-si/1/base 2025-12-04T09:43:32.1885309Z * [new branch] gh/anshul-si/1/head -> origin/gh/anshul-si/1/head 2025-12-04T09:43:32.1887520Z * [new branch] gh/anshul-si/2/base -> origin/gh/anshul-si/2/base 2025-12-04T09:43:32.1889190Z * [new branch] gh/anshul-si/2/head -> origin/gh/anshul-si/2/head 2025-12-04T09:43:32.1891454Z * [new branch] gh/anshul-si/3/base -> origin/gh/anshul-si/3/base 2025-12-04T09:43:32.1893099Z * [new branch] gh/anshul-si/3/head -> origin/gh/anshul-si/3/head 2025-12-04T09:43:32.1895341Z * [new branch] gh/anshul-si/4/base -> origin/gh/anshul-si/4/base 2025-12-04T09:43:32.1897044Z * [new branch] gh/anshul-si/4/head -> origin/gh/anshul-si/4/head 2025-12-04T09:43:32.1899232Z * [new branch] gh/anshul-si/5/base -> origin/gh/anshul-si/5/base 2025-12-04T09:43:32.1900900Z * [new branch] gh/anshul-si/5/head -> origin/gh/anshul-si/5/head 2025-12-04T09:43:32.1903847Z * [new branch] gh/anshul-si/53/base -> origin/gh/anshul-si/53/base 2025-12-04T09:43:32.1905653Z * [new branch] gh/anshul-si/53/head -> origin/gh/anshul-si/53/head 2025-12-04T09:43:32.1908196Z * [new branch] gh/anshul-si/58/base -> origin/gh/anshul-si/58/base 2025-12-04T09:43:32.1909922Z * [new branch] gh/anshul-si/58/head -> origin/gh/anshul-si/58/head 2025-12-04T09:43:32.1912205Z * [new branch] gh/anshul-si/66/base -> origin/gh/anshul-si/66/base 2025-12-04T09:43:32.1913964Z * [new branch] gh/anshul-si/66/head -> origin/gh/anshul-si/66/head 2025-12-04T09:43:32.1915657Z * [new branch] gh/anshul-si/66/orig -> origin/gh/anshul-si/66/orig 2025-12-04T09:43:32.1918012Z * [new branch] gh/anshul-si/67/base -> origin/gh/anshul-si/67/base 2025-12-04T09:43:32.1919722Z * [new branch] gh/anshul-si/67/head -> origin/gh/anshul-si/67/head 2025-12-04T09:43:32.1921384Z * [new branch] gh/anshul-si/67/orig -> origin/gh/anshul-si/67/orig 2025-12-04T09:43:32.1923804Z * [new branch] gh/anshul-si/68/base -> origin/gh/anshul-si/68/base 2025-12-04T09:43:32.1925515Z * [new branch] gh/anshul-si/68/head -> origin/gh/anshul-si/68/head 2025-12-04T09:43:32.1927208Z * [new branch] gh/anshul-si/68/orig -> origin/gh/anshul-si/68/orig 2025-12-04T09:43:32.1929695Z * [new branch] gh/anshul-si/69/base -> origin/gh/anshul-si/69/base 2025-12-04T09:43:32.1931388Z * [new branch] gh/anshul-si/69/head -> origin/gh/anshul-si/69/head 2025-12-04T09:43:32.1933117Z * [new branch] gh/anshul-si/69/orig -> origin/gh/anshul-si/69/orig 2025-12-04T09:43:32.1935459Z * [new branch] gh/anshul-si/70/base -> origin/gh/anshul-si/70/base 2025-12-04T09:43:32.1937235Z * [new branch] gh/anshul-si/70/head -> origin/gh/anshul-si/70/head 2025-12-04T09:43:32.1939095Z * [new branch] gh/anshul-si/70/orig -> origin/gh/anshul-si/70/orig 2025-12-04T09:43:32.1941474Z * [new branch] gh/anshul-si/71/base -> origin/gh/anshul-si/71/base 2025-12-04T09:43:32.1943239Z * [new branch] gh/anshul-si/71/head -> origin/gh/anshul-si/71/head 2025-12-04T09:43:32.1944992Z * [new branch] gh/anshul-si/71/orig -> origin/gh/anshul-si/71/orig 2025-12-04T09:43:32.1947383Z * [new branch] gh/anshul-si/72/base -> origin/gh/anshul-si/72/base 2025-12-04T09:43:32.1949205Z * [new branch] gh/anshul-si/72/head -> origin/gh/anshul-si/72/head 2025-12-04T09:43:32.1950937Z * [new branch] gh/anshul-si/72/orig -> origin/gh/anshul-si/72/orig 2025-12-04T09:43:32.1953219Z * [new branch] gh/anshul-si/73/base -> origin/gh/anshul-si/73/base 2025-12-04T09:43:32.1954944Z * [new branch] gh/anshul-si/73/head -> origin/gh/anshul-si/73/head 2025-12-04T09:43:32.1956919Z * [new branch] gh/anshul-si/73/orig -> origin/gh/anshul-si/73/orig 2025-12-04T09:43:32.1959846Z * [new branch] gh/aorenste/132/base -> origin/gh/aorenste/132/base 2025-12-04T09:43:32.1961567Z * [new branch] gh/aorenste/132/head -> origin/gh/aorenste/132/head 2025-12-04T09:43:32.1964129Z * [new branch] gh/aorenste/134/base -> origin/gh/aorenste/134/base 2025-12-04T09:43:32.1965956Z * [new branch] gh/aorenste/134/head -> origin/gh/aorenste/134/head 2025-12-04T09:43:32.1967752Z * [new branch] gh/aorenste/134/orig -> origin/gh/aorenste/134/orig 2025-12-04T09:43:32.1970427Z * [new branch] gh/aorenste/139/base -> origin/gh/aorenste/139/base 2025-12-04T09:43:32.1972055Z * [new branch] gh/aorenste/139/head -> origin/gh/aorenste/139/head 2025-12-04T09:43:32.1973861Z * [new branch] gh/aorenste/139/orig -> origin/gh/aorenste/139/orig 2025-12-04T09:43:32.1976146Z * [new branch] gh/aorenste/141/base -> origin/gh/aorenste/141/base 2025-12-04T09:43:32.1977836Z * [new branch] gh/aorenste/141/head -> origin/gh/aorenste/141/head 2025-12-04T09:43:32.1980480Z * [new branch] gh/aorenste/145/base -> origin/gh/aorenste/145/base 2025-12-04T09:43:32.1982232Z * [new branch] gh/aorenste/145/head -> origin/gh/aorenste/145/head 2025-12-04T09:43:32.1984079Z * [new branch] gh/aorenste/145/orig -> origin/gh/aorenste/145/orig 2025-12-04T09:43:32.1986664Z * [new branch] gh/aorenste/146/base -> origin/gh/aorenste/146/base 2025-12-04T09:43:32.1988597Z * [new branch] gh/aorenste/146/head -> origin/gh/aorenste/146/head 2025-12-04T09:43:32.1990278Z * [new branch] gh/aorenste/146/orig -> origin/gh/aorenste/146/orig 2025-12-04T09:43:32.1992781Z * [new branch] gh/aorenste/147/base -> origin/gh/aorenste/147/base 2025-12-04T09:43:32.1994743Z * [new branch] gh/aorenste/147/head -> origin/gh/aorenste/147/head 2025-12-04T09:43:32.1996493Z * [new branch] gh/aorenste/147/orig -> origin/gh/aorenste/147/orig 2025-12-04T09:43:32.1998836Z * [new branch] gh/aorenste/148/base -> origin/gh/aorenste/148/base 2025-12-04T09:43:32.2000582Z * [new branch] gh/aorenste/148/head -> origin/gh/aorenste/148/head 2025-12-04T09:43:32.2002352Z * [new branch] gh/aorenste/148/orig -> origin/gh/aorenste/148/orig 2025-12-04T09:43:32.2004692Z * [new branch] gh/aorenste/149/base -> origin/gh/aorenste/149/base 2025-12-04T09:43:32.2006468Z * [new branch] gh/aorenste/149/head -> origin/gh/aorenste/149/head 2025-12-04T09:43:32.2008141Z * [new branch] gh/aorenste/149/orig -> origin/gh/aorenste/149/orig 2025-12-04T09:43:32.2010794Z * [new branch] gh/aorenste/150/base -> origin/gh/aorenste/150/base 2025-12-04T09:43:32.2012331Z * [new branch] gh/aorenste/150/head -> origin/gh/aorenste/150/head 2025-12-04T09:43:32.2014196Z * [new branch] gh/aorenste/150/orig -> origin/gh/aorenste/150/orig 2025-12-04T09:43:32.2016487Z * [new branch] gh/aorenste/151/base -> origin/gh/aorenste/151/base 2025-12-04T09:43:32.2018266Z * [new branch] gh/aorenste/151/head -> origin/gh/aorenste/151/head 2025-12-04T09:43:32.2019797Z * [new branch] gh/aorenste/151/orig -> origin/gh/aorenste/151/orig 2025-12-04T09:43:32.2022279Z * [new branch] gh/aorenste/152/base -> origin/gh/aorenste/152/base 2025-12-04T09:43:32.2024015Z * [new branch] gh/aorenste/152/head -> origin/gh/aorenste/152/head 2025-12-04T09:43:32.2025686Z * [new branch] gh/aorenste/152/orig -> origin/gh/aorenste/152/orig 2025-12-04T09:43:32.2027984Z * [new branch] gh/aorenste/153/base -> origin/gh/aorenste/153/base 2025-12-04T09:43:32.2029801Z * [new branch] gh/aorenste/153/head -> origin/gh/aorenste/153/head 2025-12-04T09:43:32.2031665Z * [new branch] gh/aorenste/153/orig -> origin/gh/aorenste/153/orig 2025-12-04T09:43:32.2033813Z * [new branch] gh/aorenste/154/base -> origin/gh/aorenste/154/base 2025-12-04T09:43:32.2035528Z * [new branch] gh/aorenste/154/head -> origin/gh/aorenste/154/head 2025-12-04T09:43:32.2037353Z * [new branch] gh/aorenste/154/orig -> origin/gh/aorenste/154/orig 2025-12-04T09:43:32.2039422Z * [new branch] gh/aorenste/155/base -> origin/gh/aorenste/155/base 2025-12-04T09:43:32.2041205Z * [new branch] gh/aorenste/155/head -> origin/gh/aorenste/155/head 2025-12-04T09:43:32.2042916Z * [new branch] gh/aorenste/155/orig -> origin/gh/aorenste/155/orig 2025-12-04T09:43:32.2045108Z * [new branch] gh/aorenste/156/base -> origin/gh/aorenste/156/base 2025-12-04T09:43:32.2046824Z * [new branch] gh/aorenste/156/head -> origin/gh/aorenste/156/head 2025-12-04T09:43:32.2048506Z * [new branch] gh/aorenste/156/orig -> origin/gh/aorenste/156/orig 2025-12-04T09:43:32.2051131Z * [new branch] gh/aorenste/157/base -> origin/gh/aorenste/157/base 2025-12-04T09:43:32.2052878Z * [new branch] gh/aorenste/157/head -> origin/gh/aorenste/157/head 2025-12-04T09:43:32.2054630Z * [new branch] gh/aorenste/157/orig -> origin/gh/aorenste/157/orig 2025-12-04T09:43:32.2057337Z * [new branch] gh/aorenste/158/base -> origin/gh/aorenste/158/base 2025-12-04T09:43:32.2059027Z * [new branch] gh/aorenste/158/head -> origin/gh/aorenste/158/head 2025-12-04T09:43:32.2060762Z * [new branch] gh/aorenste/158/orig -> origin/gh/aorenste/158/orig 2025-12-04T09:43:32.2063042Z * [new branch] gh/aorenste/159/base -> origin/gh/aorenste/159/base 2025-12-04T09:43:32.2064798Z * [new branch] gh/aorenste/159/head -> origin/gh/aorenste/159/head 2025-12-04T09:43:32.2066515Z * [new branch] gh/aorenste/159/orig -> origin/gh/aorenste/159/orig 2025-12-04T09:43:32.2069476Z * [new branch] gh/avikchaudhuri/1/base -> origin/gh/avikchaudhuri/1/base 2025-12-04T09:43:32.2071283Z * [new branch] gh/avikchaudhuri/1/head -> origin/gh/avikchaudhuri/1/head 2025-12-04T09:43:32.2073462Z * [new branch] gh/avikchaudhuri/2/base -> origin/gh/avikchaudhuri/2/base 2025-12-04T09:43:32.2075240Z * [new branch] gh/avikchaudhuri/2/head -> origin/gh/avikchaudhuri/2/head 2025-12-04T09:43:32.2076834Z * [new branch] gh/avikchaudhuri/2/orig -> origin/gh/avikchaudhuri/2/orig 2025-12-04T09:43:32.2080122Z * [new branch] gh/bdhirsh/666/base -> origin/gh/bdhirsh/666/base 2025-12-04T09:43:32.2081794Z * [new branch] gh/bdhirsh/666/head -> origin/gh/bdhirsh/666/head 2025-12-04T09:43:32.2083536Z * [new branch] gh/bdhirsh/666/orig -> origin/gh/bdhirsh/666/orig 2025-12-04T09:43:32.2085845Z * [new branch] gh/bdhirsh/668/base -> origin/gh/bdhirsh/668/base 2025-12-04T09:43:32.2087613Z * [new branch] gh/bdhirsh/668/head -> origin/gh/bdhirsh/668/head 2025-12-04T09:43:32.2089317Z * [new branch] gh/bdhirsh/668/orig -> origin/gh/bdhirsh/668/orig 2025-12-04T09:43:32.2091838Z * [new branch] gh/bdhirsh/669/base -> origin/gh/bdhirsh/669/base 2025-12-04T09:43:32.2093565Z * [new branch] gh/bdhirsh/669/head -> origin/gh/bdhirsh/669/head 2025-12-04T09:43:32.2095185Z * [new branch] gh/bdhirsh/669/orig -> origin/gh/bdhirsh/669/orig 2025-12-04T09:43:32.2097693Z * [new branch] gh/bdhirsh/670/base -> origin/gh/bdhirsh/670/base 2025-12-04T09:43:32.2099527Z * [new branch] gh/bdhirsh/670/head -> origin/gh/bdhirsh/670/head 2025-12-04T09:43:32.2101280Z * [new branch] gh/bdhirsh/670/orig -> origin/gh/bdhirsh/670/orig 2025-12-04T09:43:32.2104236Z * [new branch] gh/bdhirsh/672/base -> origin/gh/bdhirsh/672/base 2025-12-04T09:43:32.2106025Z * [new branch] gh/bdhirsh/672/head -> origin/gh/bdhirsh/672/head 2025-12-04T09:43:32.2107839Z * [new branch] gh/bdhirsh/672/orig -> origin/gh/bdhirsh/672/orig 2025-12-04T09:43:32.2110415Z * [new branch] gh/bdhirsh/675/base -> origin/gh/bdhirsh/675/base 2025-12-04T09:43:32.2112300Z * [new branch] gh/bdhirsh/675/head -> origin/gh/bdhirsh/675/head 2025-12-04T09:43:32.2114010Z * [new branch] gh/bdhirsh/675/orig -> origin/gh/bdhirsh/675/orig 2025-12-04T09:43:32.2116627Z * [new branch] gh/bdhirsh/676/base -> origin/gh/bdhirsh/676/base 2025-12-04T09:43:32.2118622Z * [new branch] gh/bdhirsh/676/head -> origin/gh/bdhirsh/676/head 2025-12-04T09:43:32.2120373Z * [new branch] gh/bdhirsh/676/orig -> origin/gh/bdhirsh/676/orig 2025-12-04T09:43:32.2122752Z * [new branch] gh/bdhirsh/677/base -> origin/gh/bdhirsh/677/base 2025-12-04T09:43:32.2125044Z * [new branch] gh/bdhirsh/677/head -> origin/gh/bdhirsh/677/head 2025-12-04T09:43:32.2126776Z * [new branch] gh/bdhirsh/677/orig -> origin/gh/bdhirsh/677/orig 2025-12-04T09:43:32.2129330Z * [new branch] gh/bdhirsh/678/base -> origin/gh/bdhirsh/678/base 2025-12-04T09:43:32.2131200Z * [new branch] gh/bdhirsh/678/head -> origin/gh/bdhirsh/678/head 2025-12-04T09:43:32.2132987Z * [new branch] gh/bdhirsh/678/orig -> origin/gh/bdhirsh/678/orig 2025-12-04T09:43:32.2135357Z * [new branch] gh/bdhirsh/679/base -> origin/gh/bdhirsh/679/base 2025-12-04T09:43:32.2137169Z * [new branch] gh/bdhirsh/679/head -> origin/gh/bdhirsh/679/head 2025-12-04T09:43:32.2138904Z * [new branch] gh/bdhirsh/679/orig -> origin/gh/bdhirsh/679/orig 2025-12-04T09:43:32.2141599Z * [new branch] gh/bdhirsh/680/base -> origin/gh/bdhirsh/680/base 2025-12-04T09:43:32.2143384Z * [new branch] gh/bdhirsh/680/head -> origin/gh/bdhirsh/680/head 2025-12-04T09:43:32.2145142Z * [new branch] gh/bdhirsh/680/orig -> origin/gh/bdhirsh/680/orig 2025-12-04T09:43:32.2147388Z * [new branch] gh/bdhirsh/681/base -> origin/gh/bdhirsh/681/base 2025-12-04T09:43:32.2149295Z * [new branch] gh/bdhirsh/681/head -> origin/gh/bdhirsh/681/head 2025-12-04T09:43:32.2151092Z * [new branch] gh/bdhirsh/681/orig -> origin/gh/bdhirsh/681/orig 2025-12-04T09:43:32.2153962Z * [new branch] gh/benjaminglass1/101/base -> origin/gh/benjaminglass1/101/base 2025-12-04T09:43:32.2155946Z * [new branch] gh/benjaminglass1/101/head -> origin/gh/benjaminglass1/101/head 2025-12-04T09:43:32.2157751Z * [new branch] gh/benjaminglass1/101/orig -> origin/gh/benjaminglass1/101/orig 2025-12-04T09:43:32.2160093Z * [new branch] gh/benjaminglass1/102/base -> origin/gh/benjaminglass1/102/base 2025-12-04T09:43:32.2161842Z * [new branch] gh/benjaminglass1/102/head -> origin/gh/benjaminglass1/102/head 2025-12-04T09:43:32.2163547Z * [new branch] gh/benjaminglass1/102/orig -> origin/gh/benjaminglass1/102/orig 2025-12-04T09:43:32.2165884Z * [new branch] gh/benjaminglass1/106/base -> origin/gh/benjaminglass1/106/base 2025-12-04T09:43:32.2167584Z * [new branch] gh/benjaminglass1/106/head -> origin/gh/benjaminglass1/106/head 2025-12-04T09:43:32.2169306Z * [new branch] gh/benjaminglass1/106/orig -> origin/gh/benjaminglass1/106/orig 2025-12-04T09:43:32.2171957Z * [new branch] gh/benjaminglass1/107/base -> origin/gh/benjaminglass1/107/base 2025-12-04T09:43:32.2173668Z * [new branch] gh/benjaminglass1/107/head -> origin/gh/benjaminglass1/107/head 2025-12-04T09:43:32.2175665Z * [new branch] gh/benjaminglass1/107/orig -> origin/gh/benjaminglass1/107/orig 2025-12-04T09:43:32.2178018Z * [new branch] gh/benjaminglass1/108/base -> origin/gh/benjaminglass1/108/base 2025-12-04T09:43:32.2179752Z * [new branch] gh/benjaminglass1/108/head -> origin/gh/benjaminglass1/108/head 2025-12-04T09:43:32.2181444Z * [new branch] gh/benjaminglass1/108/orig -> origin/gh/benjaminglass1/108/orig 2025-12-04T09:43:32.2183808Z * [new branch] gh/benjaminglass1/109/base -> origin/gh/benjaminglass1/109/base 2025-12-04T09:43:32.2185473Z * [new branch] gh/benjaminglass1/109/head -> origin/gh/benjaminglass1/109/head 2025-12-04T09:43:32.2187293Z * [new branch] gh/benjaminglass1/109/orig -> origin/gh/benjaminglass1/109/orig 2025-12-04T09:43:32.2189690Z * [new branch] gh/benjaminglass1/97/base -> origin/gh/benjaminglass1/97/base 2025-12-04T09:43:32.2191363Z * [new branch] gh/benjaminglass1/97/head -> origin/gh/benjaminglass1/97/head 2025-12-04T09:43:32.2193111Z * [new branch] gh/benjaminglass1/97/orig -> origin/gh/benjaminglass1/97/orig 2025-12-04T09:43:32.2195853Z * [new branch] gh/bobrenjc93/570/base -> origin/gh/bobrenjc93/570/base 2025-12-04T09:43:32.2197655Z * [new branch] gh/bobrenjc93/570/head -> origin/gh/bobrenjc93/570/head 2025-12-04T09:43:32.2199416Z * [new branch] gh/bobrenjc93/570/orig -> origin/gh/bobrenjc93/570/orig 2025-12-04T09:43:32.2201689Z * [new branch] gh/bobrenjc93/604/base -> origin/gh/bobrenjc93/604/base 2025-12-04T09:43:32.2203457Z * [new branch] gh/bobrenjc93/604/head -> origin/gh/bobrenjc93/604/head 2025-12-04T09:43:32.2205169Z * [new branch] gh/bobrenjc93/604/orig -> origin/gh/bobrenjc93/604/orig 2025-12-04T09:43:32.2207487Z * [new branch] gh/bobrenjc93/638/base -> origin/gh/bobrenjc93/638/base 2025-12-04T09:43:32.2209190Z * [new branch] gh/bobrenjc93/638/head -> origin/gh/bobrenjc93/638/head 2025-12-04T09:43:32.2210895Z * [new branch] gh/bobrenjc93/638/orig -> origin/gh/bobrenjc93/638/orig 2025-12-04T09:43:32.2213199Z * [new branch] gh/bobrenjc93/653/base -> origin/gh/bobrenjc93/653/base 2025-12-04T09:43:32.2214996Z * [new branch] gh/bobrenjc93/653/head -> origin/gh/bobrenjc93/653/head 2025-12-04T09:43:32.2216742Z * [new branch] gh/bobrenjc93/653/orig -> origin/gh/bobrenjc93/653/orig 2025-12-04T09:43:32.2219715Z * [new branch] gh/bobrenjc93/654/base -> origin/gh/bobrenjc93/654/base 2025-12-04T09:43:32.2221373Z * [new branch] gh/bobrenjc93/654/head -> origin/gh/bobrenjc93/654/head 2025-12-04T09:43:32.2223022Z * [new branch] gh/bobrenjc93/654/orig -> origin/gh/bobrenjc93/654/orig 2025-12-04T09:43:32.2225544Z * [new branch] gh/bobrenjc93/657/base -> origin/gh/bobrenjc93/657/base 2025-12-04T09:43:32.2227390Z * [new branch] gh/bobrenjc93/657/head -> origin/gh/bobrenjc93/657/head 2025-12-04T09:43:32.2228931Z * [new branch] gh/bobrenjc93/657/orig -> origin/gh/bobrenjc93/657/orig 2025-12-04T09:43:32.2231272Z * [new branch] gh/bobrenjc93/672/base -> origin/gh/bobrenjc93/672/base 2025-12-04T09:43:32.2232957Z * [new branch] gh/bobrenjc93/672/head -> origin/gh/bobrenjc93/672/head 2025-12-04T09:43:32.2234688Z * [new branch] gh/bobrenjc93/672/orig -> origin/gh/bobrenjc93/672/orig 2025-12-04T09:43:32.2237020Z * [new branch] gh/bobrenjc93/679/base -> origin/gh/bobrenjc93/679/base 2025-12-04T09:43:32.2238893Z * [new branch] gh/bobrenjc93/679/head -> origin/gh/bobrenjc93/679/head 2025-12-04T09:43:32.2240629Z * [new branch] gh/bobrenjc93/679/orig -> origin/gh/bobrenjc93/679/orig 2025-12-04T09:43:32.2242922Z * [new branch] gh/bobrenjc93/680/base -> origin/gh/bobrenjc93/680/base 2025-12-04T09:43:32.2244714Z * [new branch] gh/bobrenjc93/680/head -> origin/gh/bobrenjc93/680/head 2025-12-04T09:43:32.2246474Z * [new branch] gh/bobrenjc93/680/orig -> origin/gh/bobrenjc93/680/orig 2025-12-04T09:43:32.2248639Z * [new branch] gh/bobrenjc93/681/base -> origin/gh/bobrenjc93/681/base 2025-12-04T09:43:32.2250387Z * [new branch] gh/bobrenjc93/681/head -> origin/gh/bobrenjc93/681/head 2025-12-04T09:43:32.2252178Z * [new branch] gh/bobrenjc93/681/orig -> origin/gh/bobrenjc93/681/orig 2025-12-04T09:43:32.2254306Z * [new branch] gh/bobrenjc93/682/base -> origin/gh/bobrenjc93/682/base 2025-12-04T09:43:32.2256712Z * [new branch] gh/bobrenjc93/682/head -> origin/gh/bobrenjc93/682/head 2025-12-04T09:43:32.2257829Z * [new branch] gh/bobrenjc93/682/orig -> origin/gh/bobrenjc93/682/orig 2025-12-04T09:43:32.2260258Z * [new branch] gh/bobrenjc93/683/base -> origin/gh/bobrenjc93/683/base 2025-12-04T09:43:32.2262088Z * [new branch] gh/bobrenjc93/683/head -> origin/gh/bobrenjc93/683/head 2025-12-04T09:43:32.2263779Z * [new branch] gh/bobrenjc93/683/orig -> origin/gh/bobrenjc93/683/orig 2025-12-04T09:43:32.2266127Z * [new branch] gh/bobrenjc93/684/base -> origin/gh/bobrenjc93/684/base 2025-12-04T09:43:32.2268154Z * [new branch] gh/bobrenjc93/684/head -> origin/gh/bobrenjc93/684/head 2025-12-04T09:43:32.2269944Z * [new branch] gh/bobrenjc93/684/orig -> origin/gh/bobrenjc93/684/orig 2025-12-04T09:43:32.2272110Z * [new branch] gh/bobrenjc93/685/base -> origin/gh/bobrenjc93/685/base 2025-12-04T09:43:32.2274075Z * [new branch] gh/bobrenjc93/685/head -> origin/gh/bobrenjc93/685/head 2025-12-04T09:43:32.2276064Z * [new branch] gh/bobrenjc93/685/orig -> origin/gh/bobrenjc93/685/orig 2025-12-04T09:43:32.2278739Z * [new branch] gh/bobrenjc93/686/base -> origin/gh/bobrenjc93/686/base 2025-12-04T09:43:32.2281506Z * [new branch] gh/bobrenjc93/686/head -> origin/gh/bobrenjc93/686/head 2025-12-04T09:43:32.2282537Z * [new branch] gh/bobrenjc93/686/orig -> origin/gh/bobrenjc93/686/orig 2025-12-04T09:43:32.2284163Z * [new branch] gh/bobrenjc93/687/base -> origin/gh/bobrenjc93/687/base 2025-12-04T09:43:32.2286203Z * [new branch] gh/bobrenjc93/687/head -> origin/gh/bobrenjc93/687/head 2025-12-04T09:43:32.2287804Z * [new branch] gh/bobrenjc93/687/orig -> origin/gh/bobrenjc93/687/orig 2025-12-04T09:43:32.2290596Z * [new branch] gh/bobrenjc93/688/base -> origin/gh/bobrenjc93/688/base 2025-12-04T09:43:32.2292403Z * [new branch] gh/bobrenjc93/688/head -> origin/gh/bobrenjc93/688/head 2025-12-04T09:43:32.2294115Z * [new branch] gh/bobrenjc93/688/orig -> origin/gh/bobrenjc93/688/orig 2025-12-04T09:43:32.2296470Z * [new branch] gh/bobrenjc93/689/base -> origin/gh/bobrenjc93/689/base 2025-12-04T09:43:32.2298237Z * [new branch] gh/bobrenjc93/689/head -> origin/gh/bobrenjc93/689/head 2025-12-04T09:43:32.2299976Z * [new branch] gh/bobrenjc93/689/orig -> origin/gh/bobrenjc93/689/orig 2025-12-04T09:43:32.2302126Z * [new branch] gh/bobrenjc93/690/base -> origin/gh/bobrenjc93/690/base 2025-12-04T09:43:32.2303873Z * [new branch] gh/bobrenjc93/690/head -> origin/gh/bobrenjc93/690/head 2025-12-04T09:43:32.2305568Z * [new branch] gh/bobrenjc93/690/orig -> origin/gh/bobrenjc93/690/orig 2025-12-04T09:43:32.2308675Z * [new branch] gh/bobrenjc93/691/base -> origin/gh/bobrenjc93/691/base 2025-12-04T09:43:32.2310619Z * [new branch] gh/bobrenjc93/691/head -> origin/gh/bobrenjc93/691/head 2025-12-04T09:43:32.2312612Z * [new branch] gh/bobrenjc93/691/orig -> origin/gh/bobrenjc93/691/orig 2025-12-04T09:43:32.2315522Z * [new branch] gh/bobrenjc93/692/base -> origin/gh/bobrenjc93/692/base 2025-12-04T09:43:32.2317345Z * [new branch] gh/bobrenjc93/692/head -> origin/gh/bobrenjc93/692/head 2025-12-04T09:43:32.2319090Z * [new branch] gh/bobrenjc93/692/orig -> origin/gh/bobrenjc93/692/orig 2025-12-04T09:43:32.2321264Z * [new branch] gh/bobrenjc93/693/base -> origin/gh/bobrenjc93/693/base 2025-12-04T09:43:32.2323034Z * [new branch] gh/bobrenjc93/693/head -> origin/gh/bobrenjc93/693/head 2025-12-04T09:43:32.2324783Z * [new branch] gh/bobrenjc93/693/orig -> origin/gh/bobrenjc93/693/orig 2025-12-04T09:43:32.2327240Z * [new branch] gh/bobrenjc93/694/base -> origin/gh/bobrenjc93/694/base 2025-12-04T09:43:32.2328961Z * [new branch] gh/bobrenjc93/694/head -> origin/gh/bobrenjc93/694/head 2025-12-04T09:43:32.2330741Z * [new branch] gh/bobrenjc93/694/orig -> origin/gh/bobrenjc93/694/orig 2025-12-04T09:43:32.2333030Z * [new branch] gh/bobrenjc93/695/base -> origin/gh/bobrenjc93/695/base 2025-12-04T09:43:32.2334718Z * [new branch] gh/bobrenjc93/695/head -> origin/gh/bobrenjc93/695/head 2025-12-04T09:43:32.2336425Z * [new branch] gh/bobrenjc93/695/orig -> origin/gh/bobrenjc93/695/orig 2025-12-04T09:43:32.2339433Z * [new branch] gh/c00w/23/base -> origin/gh/c00w/23/base 2025-12-04T09:43:32.2341148Z * [new branch] gh/c00w/23/head -> origin/gh/c00w/23/head 2025-12-04T09:43:32.2343629Z * [new branch] gh/c00w/53/base -> origin/gh/c00w/53/base 2025-12-04T09:43:32.2345305Z * [new branch] gh/c00w/53/head -> origin/gh/c00w/53/head 2025-12-04T09:43:32.2347130Z * [new branch] gh/c00w/53/orig -> origin/gh/c00w/53/orig 2025-12-04T09:43:32.2349416Z * [new branch] gh/c00w/54/base -> origin/gh/c00w/54/base 2025-12-04T09:43:32.2351151Z * [new branch] gh/c00w/54/head -> origin/gh/c00w/54/head 2025-12-04T09:43:32.2352847Z * [new branch] gh/c00w/54/orig -> origin/gh/c00w/54/orig 2025-12-04T09:43:32.2355330Z * [new branch] gh/c00w/56/base -> origin/gh/c00w/56/base 2025-12-04T09:43:32.2357324Z * [new branch] gh/c00w/56/head -> origin/gh/c00w/56/head 2025-12-04T09:43:32.2358880Z * [new branch] gh/c00w/56/orig -> origin/gh/c00w/56/orig 2025-12-04T09:43:32.2361187Z * [new branch] gh/c00w/57/base -> origin/gh/c00w/57/base 2025-12-04T09:43:32.2363008Z * [new branch] gh/c00w/57/head -> origin/gh/c00w/57/head 2025-12-04T09:43:32.2364747Z * [new branch] gh/c00w/57/orig -> origin/gh/c00w/57/orig 2025-12-04T09:43:32.2367041Z * [new branch] gh/c00w/58/base -> origin/gh/c00w/58/base 2025-12-04T09:43:32.2368729Z * [new branch] gh/c00w/58/head -> origin/gh/c00w/58/head 2025-12-04T09:43:32.2370478Z * [new branch] gh/c00w/58/orig -> origin/gh/c00w/58/orig 2025-12-04T09:43:32.2373372Z * [new branch] gh/clee2000/1/base -> origin/gh/clee2000/1/base 2025-12-04T09:43:32.2375151Z * [new branch] gh/clee2000/1/head -> origin/gh/clee2000/1/head 2025-12-04T09:43:32.2376911Z * [new branch] gh/clee2000/1/orig -> origin/gh/clee2000/1/orig 2025-12-04T09:43:32.2379833Z * [new branch] gh/coconutruben/1/base -> origin/gh/coconutruben/1/base 2025-12-04T09:43:32.2381799Z * [new branch] gh/coconutruben/1/head -> origin/gh/coconutruben/1/head 2025-12-04T09:43:32.2384491Z * [new branch] gh/coconutruben/55/base -> origin/gh/coconutruben/55/base 2025-12-04T09:43:32.2386216Z * [new branch] gh/coconutruben/55/head -> origin/gh/coconutruben/55/head 2025-12-04T09:43:32.2388074Z * [new branch] gh/coconutruben/55/orig -> origin/gh/coconutruben/55/orig 2025-12-04T09:43:32.2390589Z * [new branch] gh/coconutruben/57/base -> origin/gh/coconutruben/57/base 2025-12-04T09:43:32.2392419Z * [new branch] gh/coconutruben/57/head -> origin/gh/coconutruben/57/head 2025-12-04T09:43:32.2394192Z * [new branch] gh/coconutruben/57/orig -> origin/gh/coconutruben/57/orig 2025-12-04T09:43:32.2396531Z * [new branch] gh/coconutruben/70/base -> origin/gh/coconutruben/70/base 2025-12-04T09:43:32.2398317Z * [new branch] gh/coconutruben/70/head -> origin/gh/coconutruben/70/head 2025-12-04T09:43:32.2400156Z * [new branch] gh/coconutruben/70/orig -> origin/gh/coconutruben/70/orig 2025-12-04T09:43:32.2402414Z * [new branch] gh/coconutruben/71/base -> origin/gh/coconutruben/71/base 2025-12-04T09:43:32.2404121Z * [new branch] gh/coconutruben/71/head -> origin/gh/coconutruben/71/head 2025-12-04T09:43:32.2406127Z * [new branch] gh/coconutruben/71/orig -> origin/gh/coconutruben/71/orig 2025-12-04T09:43:32.2408311Z * [new branch] gh/coconutruben/72/base -> origin/gh/coconutruben/72/base 2025-12-04T09:43:32.2409990Z * [new branch] gh/coconutruben/72/head -> origin/gh/coconutruben/72/head 2025-12-04T09:43:32.2411721Z * [new branch] gh/coconutruben/72/orig -> origin/gh/coconutruben/72/orig 2025-12-04T09:43:32.2413863Z * [new branch] gh/coconutruben/73/base -> origin/gh/coconutruben/73/base 2025-12-04T09:43:32.2415597Z * [new branch] gh/coconutruben/73/head -> origin/gh/coconutruben/73/head 2025-12-04T09:43:32.2417326Z * [new branch] gh/coconutruben/73/orig -> origin/gh/coconutruben/73/orig 2025-12-04T09:43:32.2419799Z * [new branch] gh/coconutruben/74/base -> origin/gh/coconutruben/74/base 2025-12-04T09:43:32.2421630Z * [new branch] gh/coconutruben/74/head -> origin/gh/coconutruben/74/head 2025-12-04T09:43:32.2423384Z * [new branch] gh/coconutruben/74/orig -> origin/gh/coconutruben/74/orig 2025-12-04T09:43:32.2425873Z * [new branch] gh/coconutruben/79/base -> origin/gh/coconutruben/79/base 2025-12-04T09:43:32.2427853Z * [new branch] gh/coconutruben/79/head -> origin/gh/coconutruben/79/head 2025-12-04T09:43:32.2429632Z * [new branch] gh/coconutruben/79/orig -> origin/gh/coconutruben/79/orig 2025-12-04T09:43:32.2432312Z * [new branch] gh/coconutruben/80/base -> origin/gh/coconutruben/80/base 2025-12-04T09:43:32.2433867Z * [new branch] gh/coconutruben/80/head -> origin/gh/coconutruben/80/head 2025-12-04T09:43:32.2435646Z * [new branch] gh/coconutruben/80/orig -> origin/gh/coconutruben/80/orig 2025-12-04T09:43:32.2438100Z * [new branch] gh/coconutruben/82/base -> origin/gh/coconutruben/82/base 2025-12-04T09:43:32.2439781Z * [new branch] gh/coconutruben/82/head -> origin/gh/coconutruben/82/head 2025-12-04T09:43:32.2441445Z * [new branch] gh/coconutruben/82/orig -> origin/gh/coconutruben/82/orig 2025-12-04T09:43:32.2443868Z * [new branch] gh/coconutruben/83/base -> origin/gh/coconutruben/83/base 2025-12-04T09:43:32.2445623Z * [new branch] gh/coconutruben/83/head -> origin/gh/coconutruben/83/head 2025-12-04T09:43:32.2447240Z * [new branch] gh/coconutruben/83/orig -> origin/gh/coconutruben/83/orig 2025-12-04T09:43:32.2449717Z * [new branch] gh/coconutruben/84/base -> origin/gh/coconutruben/84/base 2025-12-04T09:43:32.2451530Z * [new branch] gh/coconutruben/84/head -> origin/gh/coconutruben/84/head 2025-12-04T09:43:32.2453296Z * [new branch] gh/coconutruben/84/orig -> origin/gh/coconutruben/84/orig 2025-12-04T09:43:32.2455753Z * [new branch] gh/coconutruben/85/base -> origin/gh/coconutruben/85/base 2025-12-04T09:43:32.2457687Z * [new branch] gh/coconutruben/85/head -> origin/gh/coconutruben/85/head 2025-12-04T09:43:32.2459480Z * [new branch] gh/coconutruben/85/orig -> origin/gh/coconutruben/85/orig 2025-12-04T09:43:32.2461831Z * [new branch] gh/coconutruben/86/base -> origin/gh/coconutruben/86/base 2025-12-04T09:43:32.2464045Z * [new branch] gh/coconutruben/86/head -> origin/gh/coconutruben/86/head 2025-12-04T09:43:32.2465812Z * [new branch] gh/coconutruben/86/orig -> origin/gh/coconutruben/86/orig 2025-12-04T09:43:32.2468903Z * [new branch] gh/colinchan15/1/base -> origin/gh/colinchan15/1/base 2025-12-04T09:43:32.2470679Z * [new branch] gh/colinchan15/1/head -> origin/gh/colinchan15/1/head 2025-12-04T09:43:32.2472857Z * [new branch] gh/colinchan15/2/base -> origin/gh/colinchan15/2/base 2025-12-04T09:43:32.2474630Z * [new branch] gh/colinchan15/2/head -> origin/gh/colinchan15/2/head 2025-12-04T09:43:32.2476747Z * [new branch] gh/colinchan15/3/base -> origin/gh/colinchan15/3/base 2025-12-04T09:43:32.2478462Z * [new branch] gh/colinchan15/3/head -> origin/gh/colinchan15/3/head 2025-12-04T09:43:32.2480633Z * [new branch] gh/colinchan15/6/base -> origin/gh/colinchan15/6/base 2025-12-04T09:43:32.2482261Z * [new branch] gh/colinchan15/6/head -> origin/gh/colinchan15/6/head 2025-12-04T09:43:32.2485051Z * [new branch] gh/d4l3k/1/base -> origin/gh/d4l3k/1/base 2025-12-04T09:43:32.2486786Z * [new branch] gh/d4l3k/1/head -> origin/gh/d4l3k/1/head 2025-12-04T09:43:32.2489099Z * [new branch] gh/d4l3k/2/base -> origin/gh/d4l3k/2/base 2025-12-04T09:43:32.2490884Z * [new branch] gh/d4l3k/2/head -> origin/gh/d4l3k/2/head 2025-12-04T09:43:32.2492575Z * [new branch] gh/d4l3k/2/orig -> origin/gh/d4l3k/2/orig 2025-12-04T09:43:32.2494918Z * [new branch] gh/d4l3k/3/base -> origin/gh/d4l3k/3/base 2025-12-04T09:43:32.2496715Z * [new branch] gh/d4l3k/3/head -> origin/gh/d4l3k/3/head 2025-12-04T09:43:32.2498558Z * [new branch] gh/d4l3k/3/orig -> origin/gh/d4l3k/3/orig 2025-12-04T09:43:32.2501004Z * [new branch] gh/d4l3k/4/base -> origin/gh/d4l3k/4/base 2025-12-04T09:43:32.2502732Z * [new branch] gh/d4l3k/4/head -> origin/gh/d4l3k/4/head 2025-12-04T09:43:32.2504383Z * [new branch] gh/d4l3k/4/orig -> origin/gh/d4l3k/4/orig 2025-12-04T09:43:32.2506679Z * [new branch] gh/d4l3k/5/base -> origin/gh/d4l3k/5/base 2025-12-04T09:43:32.2508640Z * [new branch] gh/d4l3k/5/orig -> origin/gh/d4l3k/5/orig 2025-12-04T09:43:32.2511413Z * [new branch] gh/davidberard98/392/base -> origin/gh/davidberard98/392/base 2025-12-04T09:43:32.2513120Z * [new branch] gh/davidberard98/392/head -> origin/gh/davidberard98/392/head 2025-12-04T09:43:32.2514829Z * [new branch] gh/davidberard98/392/orig -> origin/gh/davidberard98/392/orig 2025-12-04T09:43:32.2517268Z * [new branch] gh/davidberard98/399/base -> origin/gh/davidberard98/399/base 2025-12-04T09:43:32.2519147Z * [new branch] gh/davidberard98/399/head -> origin/gh/davidberard98/399/head 2025-12-04T09:43:32.2520905Z * [new branch] gh/davidberard98/399/orig -> origin/gh/davidberard98/399/orig 2025-12-04T09:43:32.2523692Z * [new branch] gh/desertfire/605/base -> origin/gh/desertfire/605/base 2025-12-04T09:43:32.2525420Z * [new branch] gh/desertfire/605/head -> origin/gh/desertfire/605/head 2025-12-04T09:43:32.2527187Z * [new branch] gh/desertfire/605/orig -> origin/gh/desertfire/605/orig 2025-12-04T09:43:32.2529463Z * [new branch] gh/desertfire/606/base -> origin/gh/desertfire/606/base 2025-12-04T09:43:32.2531295Z * [new branch] gh/desertfire/606/head -> origin/gh/desertfire/606/head 2025-12-04T09:43:32.2533124Z * [new branch] gh/desertfire/606/orig -> origin/gh/desertfire/606/orig 2025-12-04T09:43:32.2535414Z * [new branch] gh/desertfire/607/base -> origin/gh/desertfire/607/base 2025-12-04T09:43:32.2537196Z * [new branch] gh/desertfire/607/head -> origin/gh/desertfire/607/head 2025-12-04T09:43:32.2538904Z * [new branch] gh/desertfire/607/orig -> origin/gh/desertfire/607/orig 2025-12-04T09:43:32.2541216Z * [new branch] gh/desertfire/608/base -> origin/gh/desertfire/608/base 2025-12-04T09:43:32.2542889Z * [new branch] gh/desertfire/608/head -> origin/gh/desertfire/608/head 2025-12-04T09:43:32.2544711Z * [new branch] gh/desertfire/608/orig -> origin/gh/desertfire/608/orig 2025-12-04T09:43:32.2547063Z * [new branch] gh/desertfire/609/base -> origin/gh/desertfire/609/base 2025-12-04T09:43:32.2548938Z * [new branch] gh/desertfire/609/head -> origin/gh/desertfire/609/head 2025-12-04T09:43:32.2550665Z * [new branch] gh/desertfire/609/orig -> origin/gh/desertfire/609/orig 2025-12-04T09:43:32.2553143Z * [new branch] gh/desertfire/610/base -> origin/gh/desertfire/610/base 2025-12-04T09:43:32.2554843Z * [new branch] gh/desertfire/610/head -> origin/gh/desertfire/610/head 2025-12-04T09:43:32.2557854Z * [new branch] gh/desertfire/610/orig -> origin/gh/desertfire/610/orig 2025-12-04T09:43:32.2560125Z * [new branch] gh/desertfire/611/base -> origin/gh/desertfire/611/base 2025-12-04T09:43:32.2561868Z * [new branch] gh/desertfire/611/head -> origin/gh/desertfire/611/head 2025-12-04T09:43:32.2563622Z * [new branch] gh/desertfire/611/orig -> origin/gh/desertfire/611/orig 2025-12-04T09:43:32.2565989Z * [new branch] gh/desertfire/612/base -> origin/gh/desertfire/612/base 2025-12-04T09:43:32.2567789Z * [new branch] gh/desertfire/612/head -> origin/gh/desertfire/612/head 2025-12-04T09:43:32.2569484Z * [new branch] gh/desertfire/612/orig -> origin/gh/desertfire/612/orig 2025-12-04T09:43:32.2571780Z * [new branch] gh/desertfire/613/base -> origin/gh/desertfire/613/base 2025-12-04T09:43:32.2573546Z * [new branch] gh/desertfire/613/head -> origin/gh/desertfire/613/head 2025-12-04T09:43:32.2575372Z * [new branch] gh/desertfire/613/orig -> origin/gh/desertfire/613/orig 2025-12-04T09:43:32.2577788Z * [new branch] gh/desertfire/614/base -> origin/gh/desertfire/614/base 2025-12-04T09:43:32.2579639Z * [new branch] gh/desertfire/614/head -> origin/gh/desertfire/614/head 2025-12-04T09:43:32.2581408Z * [new branch] gh/desertfire/614/orig -> origin/gh/desertfire/614/orig 2025-12-04T09:43:32.2583741Z * [new branch] gh/desertfire/615/base -> origin/gh/desertfire/615/base 2025-12-04T09:43:32.2585667Z * [new branch] gh/desertfire/615/head -> origin/gh/desertfire/615/head 2025-12-04T09:43:32.2587655Z * [new branch] gh/desertfire/615/orig -> origin/gh/desertfire/615/orig 2025-12-04T09:43:32.2589880Z * [new branch] gh/desertfire/616/base -> origin/gh/desertfire/616/base 2025-12-04T09:43:32.2591726Z * [new branch] gh/desertfire/616/head -> origin/gh/desertfire/616/head 2025-12-04T09:43:32.2593564Z * [new branch] gh/desertfire/616/orig -> origin/gh/desertfire/616/orig 2025-12-04T09:43:32.2595741Z * [new branch] gh/desertfire/617/base -> origin/gh/desertfire/617/base 2025-12-04T09:43:32.2597564Z * [new branch] gh/desertfire/617/head -> origin/gh/desertfire/617/head 2025-12-04T09:43:32.2599172Z * [new branch] gh/desertfire/617/orig -> origin/gh/desertfire/617/orig 2025-12-04T09:43:32.2601981Z * [new branch] gh/dharakk/1/base -> origin/gh/dharakk/1/base 2025-12-04T09:43:32.2603778Z * [new branch] gh/dharakk/1/head -> origin/gh/dharakk/1/head 2025-12-04T09:43:32.2606636Z * [new branch] gh/drisspg/170/base -> origin/gh/drisspg/170/base 2025-12-04T09:43:32.2608347Z * [new branch] gh/drisspg/170/head -> origin/gh/drisspg/170/head 2025-12-04T09:43:32.2610086Z * [new branch] gh/drisspg/170/orig -> origin/gh/drisspg/170/orig 2025-12-04T09:43:32.2612452Z * [new branch] gh/drisspg/182/base -> origin/gh/drisspg/182/base 2025-12-04T09:43:32.2614256Z * [new branch] gh/drisspg/182/head -> origin/gh/drisspg/182/head 2025-12-04T09:43:32.2616528Z * [new branch] gh/drisspg/183/base -> origin/gh/drisspg/183/base 2025-12-04T09:43:32.2618112Z * [new branch] gh/drisspg/183/head -> origin/gh/drisspg/183/head 2025-12-04T09:43:32.2620293Z * [new branch] gh/drisspg/184/base -> origin/gh/drisspg/184/base 2025-12-04T09:43:32.2621954Z * [new branch] gh/drisspg/184/head -> origin/gh/drisspg/184/head 2025-12-04T09:43:32.2624240Z * [new branch] gh/drisspg/185/base -> origin/gh/drisspg/185/base 2025-12-04T09:43:32.2625982Z * [new branch] gh/drisspg/185/head -> origin/gh/drisspg/185/head 2025-12-04T09:43:32.2628457Z * [new branch] gh/drisspg/194/base -> origin/gh/drisspg/194/base 2025-12-04T09:43:32.2630453Z * [new branch] gh/drisspg/194/head -> origin/gh/drisspg/194/head 2025-12-04T09:43:32.2632213Z * [new branch] gh/drisspg/194/orig -> origin/gh/drisspg/194/orig 2025-12-04T09:43:32.2634520Z * [new branch] gh/drisspg/200/base -> origin/gh/drisspg/200/base 2025-12-04T09:43:32.2636278Z * [new branch] gh/drisspg/200/head -> origin/gh/drisspg/200/head 2025-12-04T09:43:32.2638120Z * [new branch] gh/drisspg/200/orig -> origin/gh/drisspg/200/orig 2025-12-04T09:43:32.2640579Z * [new branch] gh/drisspg/218/base -> origin/gh/drisspg/218/base 2025-12-04T09:43:32.2642322Z * [new branch] gh/drisspg/218/head -> origin/gh/drisspg/218/head 2025-12-04T09:43:32.2644000Z * [new branch] gh/drisspg/218/orig -> origin/gh/drisspg/218/orig 2025-12-04T09:43:32.2646319Z * [new branch] gh/drisspg/219/base -> origin/gh/drisspg/219/base 2025-12-04T09:43:32.2648032Z * [new branch] gh/drisspg/219/head -> origin/gh/drisspg/219/head 2025-12-04T09:43:32.2649855Z * [new branch] gh/drisspg/219/orig -> origin/gh/drisspg/219/orig 2025-12-04T09:43:32.2652077Z * [new branch] gh/drisspg/220/base -> origin/gh/drisspg/220/base 2025-12-04T09:43:32.2653812Z * [new branch] gh/drisspg/220/head -> origin/gh/drisspg/220/head 2025-12-04T09:43:32.2655667Z * [new branch] gh/drisspg/220/orig -> origin/gh/drisspg/220/orig 2025-12-04T09:43:32.2658088Z * [new branch] gh/drisspg/221/base -> origin/gh/drisspg/221/base 2025-12-04T09:43:32.2659889Z * [new branch] gh/drisspg/221/head -> origin/gh/drisspg/221/head 2025-12-04T09:43:32.2661560Z * [new branch] gh/drisspg/221/orig -> origin/gh/drisspg/221/orig 2025-12-04T09:43:32.2663926Z * [new branch] gh/drisspg/222/base -> origin/gh/drisspg/222/base 2025-12-04T09:43:32.2665674Z * [new branch] gh/drisspg/222/head -> origin/gh/drisspg/222/head 2025-12-04T09:43:32.2667446Z * [new branch] gh/drisspg/222/orig -> origin/gh/drisspg/222/orig 2025-12-04T09:43:32.2669850Z * [new branch] gh/drisspg/223/base -> origin/gh/drisspg/223/base 2025-12-04T09:43:32.2671563Z * [new branch] gh/drisspg/223/head -> origin/gh/drisspg/223/head 2025-12-04T09:43:32.2673287Z * [new branch] gh/drisspg/223/orig -> origin/gh/drisspg/223/orig 2025-12-04T09:43:32.2675585Z * [new branch] gh/drisspg/224/base -> origin/gh/drisspg/224/base 2025-12-04T09:43:32.2677336Z * [new branch] gh/drisspg/224/head -> origin/gh/drisspg/224/head 2025-12-04T09:43:32.2679036Z * [new branch] gh/drisspg/224/orig -> origin/gh/drisspg/224/orig 2025-12-04T09:43:32.2681356Z * [new branch] gh/drisspg/225/base -> origin/gh/drisspg/225/base 2025-12-04T09:43:32.2683184Z * [new branch] gh/drisspg/225/head -> origin/gh/drisspg/225/head 2025-12-04T09:43:32.2684854Z * [new branch] gh/drisspg/225/orig -> origin/gh/drisspg/225/orig 2025-12-04T09:43:32.2687155Z * [new branch] gh/drisspg/226/base -> origin/gh/drisspg/226/base 2025-12-04T09:43:32.2688883Z * [new branch] gh/drisspg/226/head -> origin/gh/drisspg/226/head 2025-12-04T09:43:32.2690535Z * [new branch] gh/drisspg/226/orig -> origin/gh/drisspg/226/orig 2025-12-04T09:43:32.2693905Z * [new branch] gh/drisspg/227/base -> origin/gh/drisspg/227/base 2025-12-04T09:43:32.2695628Z * [new branch] gh/drisspg/227/head -> origin/gh/drisspg/227/head 2025-12-04T09:43:32.2697376Z * [new branch] gh/drisspg/227/orig -> origin/gh/drisspg/227/orig 2025-12-04T09:43:32.2699802Z * [new branch] gh/drisspg/228/base -> origin/gh/drisspg/228/base 2025-12-04T09:43:32.2701500Z * [new branch] gh/drisspg/228/head -> origin/gh/drisspg/228/head 2025-12-04T09:43:32.2703162Z * [new branch] gh/drisspg/228/orig -> origin/gh/drisspg/228/orig 2025-12-04T09:43:32.2705557Z * [new branch] gh/drisspg/229/base -> origin/gh/drisspg/229/base 2025-12-04T09:43:32.2707399Z * [new branch] gh/drisspg/229/head -> origin/gh/drisspg/229/head 2025-12-04T09:43:32.2709317Z * [new branch] gh/drisspg/229/orig -> origin/gh/drisspg/229/orig 2025-12-04T09:43:32.2711677Z * [new branch] gh/drisspg/230/base -> origin/gh/drisspg/230/base 2025-12-04T09:43:32.2713394Z * [new branch] gh/drisspg/230/head -> origin/gh/drisspg/230/head 2025-12-04T09:43:32.2715144Z * [new branch] gh/drisspg/230/orig -> origin/gh/drisspg/230/orig 2025-12-04T09:43:32.2718116Z * [new branch] gh/dsjohns2/1/base -> origin/gh/dsjohns2/1/base 2025-12-04T09:43:32.2719859Z * [new branch] gh/dsjohns2/1/head -> origin/gh/dsjohns2/1/head 2025-12-04T09:43:32.2722735Z * [new branch] gh/dzmitry-huba/1/base -> origin/gh/dzmitry-huba/1/base 2025-12-04T09:43:32.2724512Z * [new branch] gh/dzmitry-huba/1/head -> origin/gh/dzmitry-huba/1/head 2025-12-04T09:43:32.2726953Z * [new branch] gh/dzmitry-huba/12/base -> origin/gh/dzmitry-huba/12/base 2025-12-04T09:43:32.2728806Z * [new branch] gh/dzmitry-huba/12/head -> origin/gh/dzmitry-huba/12/head 2025-12-04T09:43:32.2730686Z * [new branch] gh/dzmitry-huba/12/orig -> origin/gh/dzmitry-huba/12/orig 2025-12-04T09:43:32.2733086Z * [new branch] gh/dzmitry-huba/13/base -> origin/gh/dzmitry-huba/13/base 2025-12-04T09:43:32.2746738Z * [new branch] gh/dzmitry-huba/13/head -> origin/gh/dzmitry-huba/13/head 2025-12-04T09:43:32.2746962Z * [new branch] gh/dzmitry-huba/13/orig -> origin/gh/dzmitry-huba/13/orig 2025-12-04T09:43:32.2747138Z * [new branch] gh/dzmitry-huba/14/base -> origin/gh/dzmitry-huba/14/base 2025-12-04T09:43:32.2747389Z * [new branch] gh/dzmitry-huba/14/head -> origin/gh/dzmitry-huba/14/head 2025-12-04T09:43:32.2747552Z * [new branch] gh/dzmitry-huba/14/orig -> origin/gh/dzmitry-huba/14/orig 2025-12-04T09:43:32.2747717Z * [new branch] gh/dzmitry-huba/15/base -> origin/gh/dzmitry-huba/15/base 2025-12-04T09:43:32.2747876Z * [new branch] gh/dzmitry-huba/15/head -> origin/gh/dzmitry-huba/15/head 2025-12-04T09:43:32.2748645Z * [new branch] gh/dzmitry-huba/15/orig -> origin/gh/dzmitry-huba/15/orig 2025-12-04T09:43:32.2751330Z * [new branch] gh/dzmitry-huba/16/base -> origin/gh/dzmitry-huba/16/base 2025-12-04T09:43:32.2754126Z * [new branch] gh/dzmitry-huba/16/head -> origin/gh/dzmitry-huba/16/head 2025-12-04T09:43:32.2756019Z * [new branch] gh/dzmitry-huba/16/orig -> origin/gh/dzmitry-huba/16/orig 2025-12-04T09:43:32.2758506Z * [new branch] gh/dzmitry-huba/17/base -> origin/gh/dzmitry-huba/17/base 2025-12-04T09:43:32.2760029Z * [new branch] gh/dzmitry-huba/17/head -> origin/gh/dzmitry-huba/17/head 2025-12-04T09:43:32.2761726Z * [new branch] gh/dzmitry-huba/17/orig -> origin/gh/dzmitry-huba/17/orig 2025-12-04T09:43:32.2763972Z * [new branch] gh/dzmitry-huba/2/base -> origin/gh/dzmitry-huba/2/base 2025-12-04T09:43:32.2765615Z * [new branch] gh/dzmitry-huba/2/head -> origin/gh/dzmitry-huba/2/head 2025-12-04T09:43:32.2767790Z * [new branch] gh/dzmitry-huba/3/base -> origin/gh/dzmitry-huba/3/base 2025-12-04T09:43:32.2769459Z * [new branch] gh/dzmitry-huba/3/head -> origin/gh/dzmitry-huba/3/head 2025-12-04T09:43:32.2772345Z * [new branch] gh/eellison/808/base -> origin/gh/eellison/808/base 2025-12-04T09:43:32.2774086Z * [new branch] gh/eellison/808/head -> origin/gh/eellison/808/head 2025-12-04T09:43:32.2775989Z * [new branch] gh/eellison/808/orig -> origin/gh/eellison/808/orig 2025-12-04T09:43:32.2778511Z * [new branch] gh/eellison/822/base -> origin/gh/eellison/822/base 2025-12-04T09:43:32.2780379Z * [new branch] gh/eellison/822/head -> origin/gh/eellison/822/head 2025-12-04T09:43:32.2782162Z * [new branch] gh/eellison/822/orig -> origin/gh/eellison/822/orig 2025-12-04T09:43:32.2784406Z * [new branch] gh/eellison/823/base -> origin/gh/eellison/823/base 2025-12-04T09:43:32.2786156Z * [new branch] gh/eellison/823/head -> origin/gh/eellison/823/head 2025-12-04T09:43:32.2787930Z * [new branch] gh/eellison/823/orig -> origin/gh/eellison/823/orig 2025-12-04T09:43:32.2790289Z * [new branch] gh/eellison/862/base -> origin/gh/eellison/862/base 2025-12-04T09:43:32.2792044Z * [new branch] gh/eellison/862/head -> origin/gh/eellison/862/head 2025-12-04T09:43:32.2793729Z * [new branch] gh/eellison/862/orig -> origin/gh/eellison/862/orig 2025-12-04T09:43:32.2796263Z * [new branch] gh/eellison/863/base -> origin/gh/eellison/863/base 2025-12-04T09:43:32.2797990Z * [new branch] gh/eellison/863/head -> origin/gh/eellison/863/head 2025-12-04T09:43:32.2799854Z * [new branch] gh/eellison/863/orig -> origin/gh/eellison/863/orig 2025-12-04T09:43:32.2802090Z * [new branch] gh/eellison/864/base -> origin/gh/eellison/864/base 2025-12-04T09:43:32.2804165Z * [new branch] gh/eellison/864/head -> origin/gh/eellison/864/head 2025-12-04T09:43:32.2805726Z * [new branch] gh/eellison/864/orig -> origin/gh/eellison/864/orig 2025-12-04T09:43:32.2808256Z * [new branch] gh/eellison/865/base -> origin/gh/eellison/865/base 2025-12-04T09:43:32.2809898Z * [new branch] gh/eellison/865/head -> origin/gh/eellison/865/head 2025-12-04T09:43:32.2811556Z * [new branch] gh/eellison/865/orig -> origin/gh/eellison/865/orig 2025-12-04T09:43:32.2813823Z * [new branch] gh/eellison/866/base -> origin/gh/eellison/866/base 2025-12-04T09:43:32.2815592Z * [new branch] gh/eellison/866/head -> origin/gh/eellison/866/head 2025-12-04T09:43:32.2817265Z * [new branch] gh/eellison/866/orig -> origin/gh/eellison/866/orig 2025-12-04T09:43:32.2819747Z * [new branch] gh/eellison/867/base -> origin/gh/eellison/867/base 2025-12-04T09:43:32.2821389Z * [new branch] gh/eellison/867/head -> origin/gh/eellison/867/head 2025-12-04T09:43:32.2823185Z * [new branch] gh/eellison/867/orig -> origin/gh/eellison/867/orig 2025-12-04T09:43:32.2825700Z * [new branch] gh/eellison/868/base -> origin/gh/eellison/868/base 2025-12-04T09:43:32.2827698Z * [new branch] gh/eellison/868/head -> origin/gh/eellison/868/head 2025-12-04T09:43:32.2829503Z * [new branch] gh/eellison/868/orig -> origin/gh/eellison/868/orig 2025-12-04T09:43:32.2831833Z * [new branch] gh/eellison/869/base -> origin/gh/eellison/869/base 2025-12-04T09:43:32.2833962Z * [new branch] gh/eellison/869/head -> origin/gh/eellison/869/head 2025-12-04T09:43:32.2835717Z * [new branch] gh/eellison/869/orig -> origin/gh/eellison/869/orig 2025-12-04T09:43:32.2838069Z * [new branch] gh/eellison/870/base -> origin/gh/eellison/870/base 2025-12-04T09:43:32.2839723Z * [new branch] gh/eellison/870/head -> origin/gh/eellison/870/head 2025-12-04T09:43:32.2841443Z * [new branch] gh/eellison/870/orig -> origin/gh/eellison/870/orig 2025-12-04T09:43:32.2844236Z * [new branch] gh/eellison/871/base -> origin/gh/eellison/871/base 2025-12-04T09:43:32.2845466Z * [new branch] gh/eellison/871/head -> origin/gh/eellison/871/head 2025-12-04T09:43:32.2847351Z * [new branch] gh/eellison/871/orig -> origin/gh/eellison/871/orig 2025-12-04T09:43:32.2849821Z * [new branch] gh/eellison/872/base -> origin/gh/eellison/872/base 2025-12-04T09:43:32.2851500Z * [new branch] gh/eellison/872/head -> origin/gh/eellison/872/head 2025-12-04T09:43:32.2853136Z * [new branch] gh/eellison/872/orig -> origin/gh/eellison/872/orig 2025-12-04T09:43:32.2855608Z * [new branch] gh/eellison/873/base -> origin/gh/eellison/873/base 2025-12-04T09:43:32.2857382Z * [new branch] gh/eellison/873/head -> origin/gh/eellison/873/head 2025-12-04T09:43:32.2859183Z * [new branch] gh/eellison/873/orig -> origin/gh/eellison/873/orig 2025-12-04T09:43:32.2861474Z * [new branch] gh/eellison/874/base -> origin/gh/eellison/874/base 2025-12-04T09:43:32.2863190Z * [new branch] gh/eellison/874/head -> origin/gh/eellison/874/head 2025-12-04T09:43:32.2864889Z * [new branch] gh/eellison/874/orig -> origin/gh/eellison/874/orig 2025-12-04T09:43:32.2867884Z * [new branch] gh/eellison/875/base -> origin/gh/eellison/875/base 2025-12-04T09:43:32.2869712Z * [new branch] gh/eellison/875/head -> origin/gh/eellison/875/head 2025-12-04T09:43:32.2871528Z * [new branch] gh/eellison/875/orig -> origin/gh/eellison/875/orig 2025-12-04T09:43:32.2873937Z * [new branch] gh/eellison/876/base -> origin/gh/eellison/876/base 2025-12-04T09:43:32.2875789Z * [new branch] gh/eellison/876/head -> origin/gh/eellison/876/head 2025-12-04T09:43:32.2877530Z * [new branch] gh/eellison/876/orig -> origin/gh/eellison/876/orig 2025-12-04T09:43:32.2879867Z * [new branch] gh/eellison/877/base -> origin/gh/eellison/877/base 2025-12-04T09:43:32.2881573Z * [new branch] gh/eellison/877/head -> origin/gh/eellison/877/head 2025-12-04T09:43:32.2883334Z * [new branch] gh/eellison/877/orig -> origin/gh/eellison/877/orig 2025-12-04T09:43:32.2885734Z * [new branch] gh/eellison/878/base -> origin/gh/eellison/878/base 2025-12-04T09:43:32.2887420Z * [new branch] gh/eellison/878/head -> origin/gh/eellison/878/head 2025-12-04T09:43:32.2889161Z * [new branch] gh/eellison/878/orig -> origin/gh/eellison/878/orig 2025-12-04T09:43:32.2891492Z * [new branch] gh/eellison/879/base -> origin/gh/eellison/879/base 2025-12-04T09:43:32.2893256Z * [new branch] gh/eellison/879/head -> origin/gh/eellison/879/head 2025-12-04T09:43:32.2895052Z * [new branch] gh/eellison/879/orig -> origin/gh/eellison/879/orig 2025-12-04T09:43:32.2897418Z * [new branch] gh/eellison/880/base -> origin/gh/eellison/880/base 2025-12-04T09:43:32.2899123Z * [new branch] gh/eellison/880/head -> origin/gh/eellison/880/head 2025-12-04T09:43:32.2900882Z * [new branch] gh/eellison/880/orig -> origin/gh/eellison/880/orig 2025-12-04T09:43:32.2903268Z * [new branch] gh/eellison/881/base -> origin/gh/eellison/881/base 2025-12-04T09:43:32.2905037Z * [new branch] gh/eellison/881/head -> origin/gh/eellison/881/head 2025-12-04T09:43:32.2906782Z * [new branch] gh/eellison/881/orig -> origin/gh/eellison/881/orig 2025-12-04T09:43:32.2909259Z * [new branch] gh/eellison/882/base -> origin/gh/eellison/882/base 2025-12-04T09:43:32.2911025Z * [new branch] gh/eellison/882/head -> origin/gh/eellison/882/head 2025-12-04T09:43:32.2912786Z * [new branch] gh/eellison/882/orig -> origin/gh/eellison/882/orig 2025-12-04T09:43:32.2915104Z * [new branch] gh/eellison/883/base -> origin/gh/eellison/883/base 2025-12-04T09:43:32.2916891Z * [new branch] gh/eellison/883/head -> origin/gh/eellison/883/head 2025-12-04T09:43:32.2918781Z * [new branch] gh/eellison/883/orig -> origin/gh/eellison/883/orig 2025-12-04T09:43:32.2921313Z * [new branch] gh/eellison/884/base -> origin/gh/eellison/884/base 2025-12-04T09:43:32.2923164Z * [new branch] gh/eellison/884/head -> origin/gh/eellison/884/head 2025-12-04T09:43:32.2924807Z * [new branch] gh/eellison/884/orig -> origin/gh/eellison/884/orig 2025-12-04T09:43:32.2927587Z * [new branch] gh/etaf/147/base -> origin/gh/etaf/147/base 2025-12-04T09:43:32.2929372Z * [new branch] gh/etaf/147/head -> origin/gh/etaf/147/head 2025-12-04T09:43:32.2931832Z * [new branch] gh/etaf/154/base -> origin/gh/etaf/154/base 2025-12-04T09:43:32.2933516Z * [new branch] gh/etaf/154/head -> origin/gh/etaf/154/head 2025-12-04T09:43:32.2935202Z * [new branch] gh/etaf/154/orig -> origin/gh/etaf/154/orig 2025-12-04T09:43:32.2937525Z * [new branch] gh/etaf/156/base -> origin/gh/etaf/156/base 2025-12-04T09:43:32.2939230Z * [new branch] gh/etaf/156/head -> origin/gh/etaf/156/head 2025-12-04T09:43:32.2941010Z * [new branch] gh/etaf/156/orig -> origin/gh/etaf/156/orig 2025-12-04T09:43:32.2943503Z * [new branch] gh/etaf/157/base -> origin/gh/etaf/157/base 2025-12-04T09:43:32.2945295Z * [new branch] gh/etaf/157/head -> origin/gh/etaf/157/head 2025-12-04T09:43:32.2946986Z * [new branch] gh/etaf/157/orig -> origin/gh/etaf/157/orig 2025-12-04T09:43:32.2949469Z * [new branch] gh/etaf/158/base -> origin/gh/etaf/158/base 2025-12-04T09:43:32.2951192Z * [new branch] gh/etaf/158/head -> origin/gh/etaf/158/head 2025-12-04T09:43:32.2952886Z * [new branch] gh/etaf/158/orig -> origin/gh/etaf/158/orig 2025-12-04T09:43:32.2955154Z * [new branch] gh/etaf/159/base -> origin/gh/etaf/159/base 2025-12-04T09:43:32.2958013Z * [new branch] gh/etaf/159/head -> origin/gh/etaf/159/head 2025-12-04T09:43:32.2959734Z * [new branch] gh/etaf/159/orig -> origin/gh/etaf/159/orig 2025-12-04T09:43:32.2962035Z * [new branch] gh/etaf/160/base -> origin/gh/etaf/160/base 2025-12-04T09:43:32.2963802Z * [new branch] gh/etaf/160/head -> origin/gh/etaf/160/head 2025-12-04T09:43:32.2965603Z * [new branch] gh/etaf/160/orig -> origin/gh/etaf/160/orig 2025-12-04T09:43:32.2968055Z * [new branch] gh/etaf/161/base -> origin/gh/etaf/161/base 2025-12-04T09:43:32.2969866Z * [new branch] gh/etaf/161/head -> origin/gh/etaf/161/head 2025-12-04T09:43:32.2971569Z * [new branch] gh/etaf/161/orig -> origin/gh/etaf/161/orig 2025-12-04T09:43:32.2973925Z * [new branch] gh/etaf/166/base -> origin/gh/etaf/166/base 2025-12-04T09:43:32.2975760Z * [new branch] gh/etaf/166/head -> origin/gh/etaf/166/head 2025-12-04T09:43:32.2977489Z * [new branch] gh/etaf/166/orig -> origin/gh/etaf/166/orig 2025-12-04T09:43:32.2979777Z * [new branch] gh/etaf/167/base -> origin/gh/etaf/167/base 2025-12-04T09:43:32.2981530Z * [new branch] gh/etaf/167/head -> origin/gh/etaf/167/head 2025-12-04T09:43:32.2983223Z * [new branch] gh/etaf/167/orig -> origin/gh/etaf/167/orig 2025-12-04T09:43:32.2985693Z * [new branch] gh/etaf/168/base -> origin/gh/etaf/168/base 2025-12-04T09:43:32.2987500Z * [new branch] gh/etaf/168/head -> origin/gh/etaf/168/head 2025-12-04T09:43:32.2989387Z * [new branch] gh/etaf/168/orig -> origin/gh/etaf/168/orig 2025-12-04T09:43:32.2991933Z * [new branch] gh/etaf/172/base -> origin/gh/etaf/172/base 2025-12-04T09:43:32.2993636Z * [new branch] gh/etaf/172/head -> origin/gh/etaf/172/head 2025-12-04T09:43:32.2995328Z * [new branch] gh/etaf/172/orig -> origin/gh/etaf/172/orig 2025-12-04T09:43:32.2997722Z * [new branch] gh/etaf/173/base -> origin/gh/etaf/173/base 2025-12-04T09:43:32.2999588Z * [new branch] gh/etaf/173/head -> origin/gh/etaf/173/head 2025-12-04T09:43:32.3001522Z * [new branch] gh/etaf/173/orig -> origin/gh/etaf/173/orig 2025-12-04T09:43:32.3003871Z * [new branch] gh/etaf/174/base -> origin/gh/etaf/174/base 2025-12-04T09:43:32.3005579Z * [new branch] gh/etaf/174/head -> origin/gh/etaf/174/head 2025-12-04T09:43:32.3007970Z * [new branch] gh/etaf/175/base -> origin/gh/etaf/175/base 2025-12-04T09:43:32.3009675Z * [new branch] gh/etaf/175/head -> origin/gh/etaf/175/head 2025-12-04T09:43:32.3011318Z * [new branch] gh/etaf/175/orig -> origin/gh/etaf/175/orig 2025-12-04T09:43:32.3013770Z * [new branch] gh/etaf/176/base -> origin/gh/etaf/176/base 2025-12-04T09:43:32.3015641Z * [new branch] gh/etaf/176/head -> origin/gh/etaf/176/head 2025-12-04T09:43:32.3017356Z * [new branch] gh/etaf/176/orig -> origin/gh/etaf/176/orig 2025-12-04T09:43:32.3020367Z * [new branch] gh/etaf/177/base -> origin/gh/etaf/177/base 2025-12-04T09:43:32.3022201Z * [new branch] gh/etaf/177/head -> origin/gh/etaf/177/head 2025-12-04T09:43:32.3023973Z * [new branch] gh/etaf/177/orig -> origin/gh/etaf/177/orig 2025-12-04T09:43:32.3026444Z * [new branch] gh/etaf/178/base -> origin/gh/etaf/178/base 2025-12-04T09:43:32.3028550Z * [new branch] gh/etaf/178/head -> origin/gh/etaf/178/head 2025-12-04T09:43:32.3030328Z * [new branch] gh/etaf/178/orig -> origin/gh/etaf/178/orig 2025-12-04T09:43:32.3033173Z * [new branch] gh/etaf/179/base -> origin/gh/etaf/179/base 2025-12-04T09:43:32.3034946Z * [new branch] gh/etaf/179/head -> origin/gh/etaf/179/head 2025-12-04T09:43:32.3036646Z * [new branch] gh/etaf/179/orig -> origin/gh/etaf/179/orig 2025-12-04T09:43:32.3038988Z * [new branch] gh/etaf/180/base -> origin/gh/etaf/180/base 2025-12-04T09:43:32.3040788Z * [new branch] gh/etaf/180/head -> origin/gh/etaf/180/head 2025-12-04T09:43:32.3042553Z * [new branch] gh/etaf/180/orig -> origin/gh/etaf/180/orig 2025-12-04T09:43:32.3045553Z * [new branch] gh/exclamaforte/1/base -> origin/gh/exclamaforte/1/base 2025-12-04T09:43:32.3047635Z * [new branch] gh/exclamaforte/1/head -> origin/gh/exclamaforte/1/head 2025-12-04T09:43:32.3049896Z * [new branch] gh/exclamaforte/2/base -> origin/gh/exclamaforte/2/base 2025-12-04T09:43:32.3051457Z * [new branch] gh/exclamaforte/2/head -> origin/gh/exclamaforte/2/head 2025-12-04T09:43:32.3053789Z * [new branch] gh/exclamaforte/3/base -> origin/gh/exclamaforte/3/base 2025-12-04T09:43:32.3055686Z * [new branch] gh/exclamaforte/3/head -> origin/gh/exclamaforte/3/head 2025-12-04T09:43:32.3058119Z * [new branch] gh/exclamaforte/4/base -> origin/gh/exclamaforte/4/base 2025-12-04T09:43:32.3059830Z * [new branch] gh/exclamaforte/4/head -> origin/gh/exclamaforte/4/head 2025-12-04T09:43:32.3062690Z * [new branch] gh/ezyang/2374/base -> origin/gh/ezyang/2374/base 2025-12-04T09:43:32.3064449Z * [new branch] gh/ezyang/2374/head -> origin/gh/ezyang/2374/head 2025-12-04T09:43:32.3066361Z * [new branch] gh/ezyang/2374/orig -> origin/gh/ezyang/2374/orig 2025-12-04T09:43:32.3068675Z * [new branch] gh/ezyang/2973/base -> origin/gh/ezyang/2973/base 2025-12-04T09:43:32.3070438Z * [new branch] gh/ezyang/2973/head -> origin/gh/ezyang/2973/head 2025-12-04T09:43:32.3072198Z * [new branch] gh/ezyang/2973/orig -> origin/gh/ezyang/2973/orig 2025-12-04T09:43:32.3074513Z * [new branch] gh/ezyang/2974/base -> origin/gh/ezyang/2974/base 2025-12-04T09:43:32.3076200Z * [new branch] gh/ezyang/2974/head -> origin/gh/ezyang/2974/head 2025-12-04T09:43:32.3077877Z * [new branch] gh/ezyang/2974/orig -> origin/gh/ezyang/2974/orig 2025-12-04T09:43:32.3080232Z * [new branch] gh/ezyang/3131/base -> origin/gh/ezyang/3131/base 2025-12-04T09:43:32.3081983Z * [new branch] gh/ezyang/3131/head -> origin/gh/ezyang/3131/head 2025-12-04T09:43:32.3083716Z * [new branch] gh/ezyang/3131/orig -> origin/gh/ezyang/3131/orig 2025-12-04T09:43:32.3086027Z * [new branch] gh/ezyang/3139/base -> origin/gh/ezyang/3139/base 2025-12-04T09:43:32.3087762Z * [new branch] gh/ezyang/3139/head -> origin/gh/ezyang/3139/head 2025-12-04T09:43:32.3089519Z * [new branch] gh/ezyang/3139/orig -> origin/gh/ezyang/3139/orig 2025-12-04T09:43:32.3091786Z * [new branch] gh/ezyang/3140/base -> origin/gh/ezyang/3140/base 2025-12-04T09:43:32.3093462Z * [new branch] gh/ezyang/3140/head -> origin/gh/ezyang/3140/head 2025-12-04T09:43:32.3095306Z * [new branch] gh/ezyang/3140/orig -> origin/gh/ezyang/3140/orig 2025-12-04T09:43:32.3097631Z * [new branch] gh/ezyang/3143/base -> origin/gh/ezyang/3143/base 2025-12-04T09:43:32.3099303Z * [new branch] gh/ezyang/3143/head -> origin/gh/ezyang/3143/head 2025-12-04T09:43:32.3101038Z * [new branch] gh/ezyang/3143/orig -> origin/gh/ezyang/3143/orig 2025-12-04T09:43:32.3103431Z * [new branch] gh/ezyang/3144/base -> origin/gh/ezyang/3144/base 2025-12-04T09:43:32.3105246Z * [new branch] gh/ezyang/3144/head -> origin/gh/ezyang/3144/head 2025-12-04T09:43:32.3106903Z * [new branch] gh/ezyang/3144/orig -> origin/gh/ezyang/3144/orig 2025-12-04T09:43:32.3109379Z * [new branch] gh/ezyang/3167/base -> origin/gh/ezyang/3167/base 2025-12-04T09:43:32.3111078Z * [new branch] gh/ezyang/3167/head -> origin/gh/ezyang/3167/head 2025-12-04T09:43:32.3112766Z * [new branch] gh/ezyang/3167/orig -> origin/gh/ezyang/3167/orig 2025-12-04T09:43:32.3115056Z * [new branch] gh/ezyang/3173/base -> origin/gh/ezyang/3173/base 2025-12-04T09:43:32.3116808Z * [new branch] gh/ezyang/3173/head -> origin/gh/ezyang/3173/head 2025-12-04T09:43:32.3118662Z * [new branch] gh/ezyang/3173/orig -> origin/gh/ezyang/3173/orig 2025-12-04T09:43:32.3121008Z * [new branch] gh/ezyang/3175/base -> origin/gh/ezyang/3175/base 2025-12-04T09:43:32.3122740Z * [new branch] gh/ezyang/3175/head -> origin/gh/ezyang/3175/head 2025-12-04T09:43:32.3124467Z * [new branch] gh/ezyang/3175/orig -> origin/gh/ezyang/3175/orig 2025-12-04T09:43:32.3126771Z * [new branch] gh/ezyang/3182/base -> origin/gh/ezyang/3182/base 2025-12-04T09:43:32.3128563Z * [new branch] gh/ezyang/3182/head -> origin/gh/ezyang/3182/head 2025-12-04T09:43:32.3130248Z * [new branch] gh/ezyang/3182/orig -> origin/gh/ezyang/3182/orig 2025-12-04T09:43:32.3132620Z * [new branch] gh/ezyang/3185/base -> origin/gh/ezyang/3185/base 2025-12-04T09:43:32.3134388Z * [new branch] gh/ezyang/3185/head -> origin/gh/ezyang/3185/head 2025-12-04T09:43:32.3136033Z * [new branch] gh/ezyang/3185/orig -> origin/gh/ezyang/3185/orig 2025-12-04T09:43:32.3138359Z * [new branch] gh/ezyang/3189/base -> origin/gh/ezyang/3189/base 2025-12-04T09:43:32.3140055Z * [new branch] gh/ezyang/3189/head -> origin/gh/ezyang/3189/head 2025-12-04T09:43:32.3141748Z * [new branch] gh/ezyang/3189/orig -> origin/gh/ezyang/3189/orig 2025-12-04T09:43:32.3144085Z * [new branch] gh/ezyang/3191/base -> origin/gh/ezyang/3191/base 2025-12-04T09:43:32.3145826Z * [new branch] gh/ezyang/3191/head -> origin/gh/ezyang/3191/head 2025-12-04T09:43:32.3147553Z * [new branch] gh/ezyang/3191/orig -> origin/gh/ezyang/3191/orig 2025-12-04T09:43:32.3150396Z * [new branch] gh/ezyang/3192/base -> origin/gh/ezyang/3192/base 2025-12-04T09:43:32.3152191Z * [new branch] gh/ezyang/3192/head -> origin/gh/ezyang/3192/head 2025-12-04T09:43:32.3153945Z * [new branch] gh/ezyang/3192/orig -> origin/gh/ezyang/3192/orig 2025-12-04T09:43:32.3156670Z * [new branch] gh/ezyang/3193/base -> origin/gh/ezyang/3193/base 2025-12-04T09:43:32.3158376Z * [new branch] gh/ezyang/3193/head -> origin/gh/ezyang/3193/head 2025-12-04T09:43:32.3160260Z * [new branch] gh/ezyang/3193/orig -> origin/gh/ezyang/3193/orig 2025-12-04T09:43:32.3162659Z * [new branch] gh/ezyang/3194/base -> origin/gh/ezyang/3194/base 2025-12-04T09:43:32.3164338Z * [new branch] gh/ezyang/3194/head -> origin/gh/ezyang/3194/head 2025-12-04T09:43:32.3166054Z * [new branch] gh/ezyang/3194/orig -> origin/gh/ezyang/3194/orig 2025-12-04T09:43:32.3168376Z * [new branch] gh/ezyang/3195/base -> origin/gh/ezyang/3195/base 2025-12-04T09:43:32.3170082Z * [new branch] gh/ezyang/3195/head -> origin/gh/ezyang/3195/head 2025-12-04T09:43:32.3171739Z * [new branch] gh/ezyang/3195/orig -> origin/gh/ezyang/3195/orig 2025-12-04T09:43:32.3174060Z * [new branch] gh/ezyang/3196/base -> origin/gh/ezyang/3196/base 2025-12-04T09:43:32.3175905Z * [new branch] gh/ezyang/3196/head -> origin/gh/ezyang/3196/head 2025-12-04T09:43:32.3177691Z * [new branch] gh/ezyang/3196/orig -> origin/gh/ezyang/3196/orig 2025-12-04T09:43:32.3180057Z * [new branch] gh/ezyang/3197/base -> origin/gh/ezyang/3197/base 2025-12-04T09:43:32.3181733Z * [new branch] gh/ezyang/3197/head -> origin/gh/ezyang/3197/head 2025-12-04T09:43:32.3183458Z * [new branch] gh/ezyang/3197/orig -> origin/gh/ezyang/3197/orig 2025-12-04T09:43:32.3185828Z * [new branch] gh/ezyang/3198/base -> origin/gh/ezyang/3198/base 2025-12-04T09:43:32.3187626Z * [new branch] gh/ezyang/3198/head -> origin/gh/ezyang/3198/head 2025-12-04T09:43:32.3189408Z * [new branch] gh/ezyang/3198/orig -> origin/gh/ezyang/3198/orig 2025-12-04T09:43:32.3191773Z * [new branch] gh/ezyang/3199/base -> origin/gh/ezyang/3199/base 2025-12-04T09:43:32.3193508Z * [new branch] gh/ezyang/3199/head -> origin/gh/ezyang/3199/head 2025-12-04T09:43:32.3195230Z * [new branch] gh/ezyang/3199/orig -> origin/gh/ezyang/3199/orig 2025-12-04T09:43:32.3197565Z * [new branch] gh/ezyang/3200/base -> origin/gh/ezyang/3200/base 2025-12-04T09:43:32.3199382Z * [new branch] gh/ezyang/3200/head -> origin/gh/ezyang/3200/head 2025-12-04T09:43:32.3201062Z * [new branch] gh/ezyang/3200/orig -> origin/gh/ezyang/3200/orig 2025-12-04T09:43:32.3203674Z * [new branch] gh/ezyang/3201/base -> origin/gh/ezyang/3201/base 2025-12-04T09:43:32.3205504Z * [new branch] gh/ezyang/3201/head -> origin/gh/ezyang/3201/head 2025-12-04T09:43:32.3207091Z * [new branch] gh/ezyang/3201/orig -> origin/gh/ezyang/3201/orig 2025-12-04T09:43:32.3209403Z * [new branch] gh/ezyang/3202/base -> origin/gh/ezyang/3202/base 2025-12-04T09:43:32.3211115Z * [new branch] gh/ezyang/3202/head -> origin/gh/ezyang/3202/head 2025-12-04T09:43:32.3212864Z * [new branch] gh/ezyang/3202/orig -> origin/gh/ezyang/3202/orig 2025-12-04T09:43:32.3215185Z * [new branch] gh/ezyang/3203/base -> origin/gh/ezyang/3203/base 2025-12-04T09:43:32.3216967Z * [new branch] gh/ezyang/3203/head -> origin/gh/ezyang/3203/head 2025-12-04T09:43:32.3218783Z * [new branch] gh/ezyang/3203/orig -> origin/gh/ezyang/3203/orig 2025-12-04T09:43:32.3221248Z * [new branch] gh/ezyang/3204/base -> origin/gh/ezyang/3204/base 2025-12-04T09:43:32.3223033Z * [new branch] gh/ezyang/3204/head -> origin/gh/ezyang/3204/head 2025-12-04T09:43:32.3224739Z * [new branch] gh/ezyang/3204/orig -> origin/gh/ezyang/3204/orig 2025-12-04T09:43:32.3227189Z * [new branch] gh/ezyang/3205/base -> origin/gh/ezyang/3205/base 2025-12-04T09:43:32.3229087Z * [new branch] gh/ezyang/3205/head -> origin/gh/ezyang/3205/head 2025-12-04T09:43:32.3230686Z * [new branch] gh/ezyang/3205/orig -> origin/gh/ezyang/3205/orig 2025-12-04T09:43:32.3233602Z * [new branch] gh/ezyang/3206/base -> origin/gh/ezyang/3206/base 2025-12-04T09:43:32.3235302Z * [new branch] gh/ezyang/3206/head -> origin/gh/ezyang/3206/head 2025-12-04T09:43:32.3237022Z * [new branch] gh/ezyang/3206/orig -> origin/gh/ezyang/3206/orig 2025-12-04T09:43:32.3239362Z * [new branch] gh/ezyang/3207/base -> origin/gh/ezyang/3207/base 2025-12-04T09:43:32.3241112Z * [new branch] gh/ezyang/3207/head -> origin/gh/ezyang/3207/head 2025-12-04T09:43:32.3242796Z * [new branch] gh/ezyang/3207/orig -> origin/gh/ezyang/3207/orig 2025-12-04T09:43:32.3245169Z * [new branch] gh/ezyang/3208/base -> origin/gh/ezyang/3208/base 2025-12-04T09:43:32.3246982Z * [new branch] gh/ezyang/3208/head -> origin/gh/ezyang/3208/head 2025-12-04T09:43:32.3248733Z * [new branch] gh/ezyang/3208/orig -> origin/gh/ezyang/3208/orig 2025-12-04T09:43:32.3251185Z * [new branch] gh/ezyang/3209/base -> origin/gh/ezyang/3209/base 2025-12-04T09:43:32.3252928Z * [new branch] gh/ezyang/3209/head -> origin/gh/ezyang/3209/head 2025-12-04T09:43:32.3254703Z * [new branch] gh/ezyang/3209/orig -> origin/gh/ezyang/3209/orig 2025-12-04T09:43:32.3257682Z * [new branch] gh/fadara01/3/base -> origin/gh/fadara01/3/base 2025-12-04T09:43:32.3259333Z * [new branch] gh/fadara01/3/head -> origin/gh/fadara01/3/head 2025-12-04T09:43:32.3260994Z * [new branch] gh/fadara01/3/orig -> origin/gh/fadara01/3/orig 2025-12-04T09:43:32.3263310Z * [new branch] gh/fadara01/5/base -> origin/gh/fadara01/5/base 2025-12-04T09:43:32.3265041Z * [new branch] gh/fadara01/5/head -> origin/gh/fadara01/5/head 2025-12-04T09:43:32.3266774Z * [new branch] gh/fadara01/5/orig -> origin/gh/fadara01/5/orig 2025-12-04T09:43:32.3269378Z * [new branch] gh/fadara01/6/base -> origin/gh/fadara01/6/base 2025-12-04T09:43:32.3271038Z * [new branch] gh/fadara01/6/head -> origin/gh/fadara01/6/head 2025-12-04T09:43:32.3272719Z * [new branch] gh/fadara01/6/orig -> origin/gh/fadara01/6/orig 2025-12-04T09:43:32.3275165Z * [new branch] gh/fadara01/7/base -> origin/gh/fadara01/7/base 2025-12-04T09:43:32.3276759Z * [new branch] gh/fadara01/7/head -> origin/gh/fadara01/7/head 2025-12-04T09:43:32.3278587Z * [new branch] gh/fadara01/7/orig -> origin/gh/fadara01/7/orig 2025-12-04T09:43:32.3280949Z * [new branch] gh/fadara01/8/base -> origin/gh/fadara01/8/base 2025-12-04T09:43:32.3282723Z * [new branch] gh/fadara01/8/head -> origin/gh/fadara01/8/head 2025-12-04T09:43:32.3284449Z * [new branch] gh/fadara01/8/orig -> origin/gh/fadara01/8/orig 2025-12-04T09:43:32.3286714Z * [new branch] gh/fadara01/9/base -> origin/gh/fadara01/9/base 2025-12-04T09:43:32.3288471Z * [new branch] gh/fadara01/9/head -> origin/gh/fadara01/9/head 2025-12-04T09:43:32.3290198Z * [new branch] gh/fadara01/9/orig -> origin/gh/fadara01/9/orig 2025-12-04T09:43:32.3293132Z * [new branch] gh/fduwjj/182/base -> origin/gh/fduwjj/182/base 2025-12-04T09:43:32.3294856Z * [new branch] gh/fduwjj/182/head -> origin/gh/fduwjj/182/head 2025-12-04T09:43:32.3296550Z * [new branch] gh/fduwjj/182/orig -> origin/gh/fduwjj/182/orig 2025-12-04T09:43:32.3298890Z * [new branch] gh/fduwjj/211/base -> origin/gh/fduwjj/211/base 2025-12-04T09:43:32.3300598Z * [new branch] gh/fduwjj/211/head -> origin/gh/fduwjj/211/head 2025-12-04T09:43:32.3302369Z * [new branch] gh/fduwjj/211/orig -> origin/gh/fduwjj/211/orig 2025-12-04T09:43:32.3304611Z * [new branch] gh/fduwjj/212/base -> origin/gh/fduwjj/212/base 2025-12-04T09:43:32.3306358Z * [new branch] gh/fduwjj/212/head -> origin/gh/fduwjj/212/head 2025-12-04T09:43:32.3308179Z * [new branch] gh/fduwjj/212/orig -> origin/gh/fduwjj/212/orig 2025-12-04T09:43:32.3310501Z * [new branch] gh/fduwjj/213/base -> origin/gh/fduwjj/213/base 2025-12-04T09:43:32.3312233Z * [new branch] gh/fduwjj/213/head -> origin/gh/fduwjj/213/head 2025-12-04T09:43:32.3313931Z * [new branch] gh/fduwjj/213/orig -> origin/gh/fduwjj/213/orig 2025-12-04T09:43:32.3316485Z * [new branch] gh/fduwjj/226/base -> origin/gh/fduwjj/226/base 2025-12-04T09:43:32.3318127Z * [new branch] gh/fduwjj/226/head -> origin/gh/fduwjj/226/head 2025-12-04T09:43:32.3319771Z * [new branch] gh/fduwjj/226/orig -> origin/gh/fduwjj/226/orig 2025-12-04T09:43:32.3322240Z * [new branch] gh/fduwjj/229/base -> origin/gh/fduwjj/229/base 2025-12-04T09:43:32.3323901Z * [new branch] gh/fduwjj/229/head -> origin/gh/fduwjj/229/head 2025-12-04T09:43:32.3325714Z * [new branch] gh/fduwjj/229/orig -> origin/gh/fduwjj/229/orig 2025-12-04T09:43:32.3328075Z * [new branch] gh/fduwjj/233/base -> origin/gh/fduwjj/233/base 2025-12-04T09:43:32.3329847Z * [new branch] gh/fduwjj/233/head -> origin/gh/fduwjj/233/head 2025-12-04T09:43:32.3331494Z * [new branch] gh/fduwjj/233/orig -> origin/gh/fduwjj/233/orig 2025-12-04T09:43:32.3333900Z * [new branch] gh/fduwjj/234/base -> origin/gh/fduwjj/234/base 2025-12-04T09:43:32.3335658Z * [new branch] gh/fduwjj/234/head -> origin/gh/fduwjj/234/head 2025-12-04T09:43:32.3337389Z * [new branch] gh/fduwjj/234/orig -> origin/gh/fduwjj/234/orig 2025-12-04T09:43:32.3339846Z * [new branch] gh/fduwjj/235/base -> origin/gh/fduwjj/235/base 2025-12-04T09:43:32.3341566Z * [new branch] gh/fduwjj/235/head -> origin/gh/fduwjj/235/head 2025-12-04T09:43:32.3343360Z * [new branch] gh/fduwjj/235/orig -> origin/gh/fduwjj/235/orig 2025-12-04T09:43:32.3345619Z * [new branch] gh/fduwjj/236/base -> origin/gh/fduwjj/236/base 2025-12-04T09:43:32.3347312Z * [new branch] gh/fduwjj/236/head -> origin/gh/fduwjj/236/head 2025-12-04T09:43:32.3349084Z * [new branch] gh/fduwjj/236/orig -> origin/gh/fduwjj/236/orig 2025-12-04T09:43:32.3351236Z * [new branch] gh/fduwjj/237/base -> origin/gh/fduwjj/237/base 2025-12-04T09:43:32.3352991Z * [new branch] gh/fduwjj/237/head -> origin/gh/fduwjj/237/head 2025-12-04T09:43:32.3354707Z * [new branch] gh/fduwjj/237/orig -> origin/gh/fduwjj/237/orig 2025-12-04T09:43:32.3357410Z * [new branch] gh/fduwjj/238/base -> origin/gh/fduwjj/238/base 2025-12-04T09:43:32.3359051Z * [new branch] gh/fduwjj/238/head -> origin/gh/fduwjj/238/head 2025-12-04T09:43:32.3360789Z * [new branch] gh/fduwjj/238/orig -> origin/gh/fduwjj/238/orig 2025-12-04T09:43:32.3363166Z * [new branch] gh/fduwjj/239/base -> origin/gh/fduwjj/239/base 2025-12-04T09:43:32.3364970Z * [new branch] gh/fduwjj/239/head -> origin/gh/fduwjj/239/head 2025-12-04T09:43:32.3366689Z * [new branch] gh/fduwjj/239/orig -> origin/gh/fduwjj/239/orig 2025-12-04T09:43:32.3369458Z * [new branch] gh/fegin/332/base -> origin/gh/fegin/332/base 2025-12-04T09:43:32.3371191Z * [new branch] gh/fegin/332/head -> origin/gh/fegin/332/head 2025-12-04T09:43:32.3372992Z * [new branch] gh/fegin/332/orig -> origin/gh/fegin/332/orig 2025-12-04T09:43:32.3375250Z * [new branch] gh/fegin/333/base -> origin/gh/fegin/333/base 2025-12-04T09:43:32.3376940Z * [new branch] gh/fegin/333/head -> origin/gh/fegin/333/head 2025-12-04T09:43:32.3378686Z * [new branch] gh/fegin/333/orig -> origin/gh/fegin/333/orig 2025-12-04T09:43:32.3381352Z * [new branch] gh/fegin/334/base -> origin/gh/fegin/334/base 2025-12-04T09:43:32.3383025Z * [new branch] gh/fegin/334/head -> origin/gh/fegin/334/head 2025-12-04T09:43:32.3384940Z * [new branch] gh/fegin/334/orig -> origin/gh/fegin/334/orig 2025-12-04T09:43:32.3387744Z * [new branch] gh/fegin/335/base -> origin/gh/fegin/335/base 2025-12-04T09:43:32.3389485Z * [new branch] gh/fegin/335/head -> origin/gh/fegin/335/head 2025-12-04T09:43:32.3391138Z * [new branch] gh/fegin/335/orig -> origin/gh/fegin/335/orig 2025-12-04T09:43:32.3393887Z * [new branch] gh/fffrog/160/base -> origin/gh/fffrog/160/base 2025-12-04T09:43:32.3395641Z * [new branch] gh/fffrog/160/head -> origin/gh/fffrog/160/head 2025-12-04T09:43:32.3397989Z * [new branch] gh/fffrog/177/base -> origin/gh/fffrog/177/base 2025-12-04T09:43:32.3399710Z * [new branch] gh/fffrog/177/head -> origin/gh/fffrog/177/head 2025-12-04T09:43:32.3401542Z * [new branch] gh/fffrog/177/orig -> origin/gh/fffrog/177/orig 2025-12-04T09:43:32.3403832Z * [new branch] gh/fffrog/178/base -> origin/gh/fffrog/178/base 2025-12-04T09:43:32.3405534Z * [new branch] gh/fffrog/178/head -> origin/gh/fffrog/178/head 2025-12-04T09:43:32.3407330Z * [new branch] gh/fffrog/178/orig -> origin/gh/fffrog/178/orig 2025-12-04T09:43:32.3409777Z * [new branch] gh/fffrog/181/base -> origin/gh/fffrog/181/base 2025-12-04T09:43:32.3411541Z * [new branch] gh/fffrog/181/head -> origin/gh/fffrog/181/head 2025-12-04T09:43:32.3413248Z * [new branch] gh/fffrog/181/orig -> origin/gh/fffrog/181/orig 2025-12-04T09:43:32.3415631Z * [new branch] gh/fffrog/183/base -> origin/gh/fffrog/183/base 2025-12-04T09:43:32.3417240Z * [new branch] gh/fffrog/183/head -> origin/gh/fffrog/183/head 2025-12-04T09:43:32.3418919Z * [new branch] gh/fffrog/183/orig -> origin/gh/fffrog/183/orig 2025-12-04T09:43:32.3421986Z * [new branch] gh/fxdawnn/10/base -> origin/gh/fxdawnn/10/base 2025-12-04T09:43:32.3423550Z * [new branch] gh/fxdawnn/10/head -> origin/gh/fxdawnn/10/head 2025-12-04T09:43:32.3425377Z * [new branch] gh/fxdawnn/10/orig -> origin/gh/fxdawnn/10/orig 2025-12-04T09:43:32.3427762Z * [new branch] gh/fxdawnn/11/base -> origin/gh/fxdawnn/11/base 2025-12-04T09:43:32.3429702Z * [new branch] gh/fxdawnn/11/head -> origin/gh/fxdawnn/11/head 2025-12-04T09:43:32.3431335Z * [new branch] gh/fxdawnn/11/orig -> origin/gh/fxdawnn/11/orig 2025-12-04T09:43:32.3433660Z * [new branch] gh/fxdawnn/12/base -> origin/gh/fxdawnn/12/base 2025-12-04T09:43:32.3435426Z * [new branch] gh/fxdawnn/12/head -> origin/gh/fxdawnn/12/head 2025-12-04T09:43:32.3437155Z * [new branch] gh/fxdawnn/12/orig -> origin/gh/fxdawnn/12/orig 2025-12-04T09:43:32.3439536Z * [new branch] gh/fxdawnn/13/base -> origin/gh/fxdawnn/13/base 2025-12-04T09:43:32.3441237Z * [new branch] gh/fxdawnn/13/head -> origin/gh/fxdawnn/13/head 2025-12-04T09:43:32.3442979Z * [new branch] gh/fxdawnn/13/orig -> origin/gh/fxdawnn/13/orig 2025-12-04T09:43:32.3445393Z * [new branch] gh/fxdawnn/14/base -> origin/gh/fxdawnn/14/base 2025-12-04T09:43:32.3447048Z * [new branch] gh/fxdawnn/14/head -> origin/gh/fxdawnn/14/head 2025-12-04T09:43:32.3448743Z * [new branch] gh/fxdawnn/14/orig -> origin/gh/fxdawnn/14/orig 2025-12-04T09:43:32.3451176Z * [new branch] gh/fxdawnn/15/base -> origin/gh/fxdawnn/15/base 2025-12-04T09:43:32.3452988Z * [new branch] gh/fxdawnn/15/head -> origin/gh/fxdawnn/15/head 2025-12-04T09:43:32.3454666Z * [new branch] gh/fxdawnn/15/orig -> origin/gh/fxdawnn/15/orig 2025-12-04T09:43:32.3458345Z * [new branch] gh/fxdawnn/6/base -> origin/gh/fxdawnn/6/base 2025-12-04T09:43:32.3460022Z * [new branch] gh/fxdawnn/6/head -> origin/gh/fxdawnn/6/head 2025-12-04T09:43:32.3461718Z * [new branch] gh/fxdawnn/6/orig -> origin/gh/fxdawnn/6/orig 2025-12-04T09:43:32.3464125Z * [new branch] gh/fxdawnn/7/base -> origin/gh/fxdawnn/7/base 2025-12-04T09:43:32.3466289Z * [new branch] gh/fxdawnn/7/head -> origin/gh/fxdawnn/7/head 2025-12-04T09:43:32.3467705Z * [new branch] gh/fxdawnn/7/orig -> origin/gh/fxdawnn/7/orig 2025-12-04T09:43:32.3470163Z * [new branch] gh/fxdawnn/9/base -> origin/gh/fxdawnn/9/base 2025-12-04T09:43:32.3471789Z * [new branch] gh/fxdawnn/9/head -> origin/gh/fxdawnn/9/head 2025-12-04T09:43:32.3473499Z * [new branch] gh/fxdawnn/9/orig -> origin/gh/fxdawnn/9/orig 2025-12-04T09:43:32.3476357Z * [new branch] gh/galv/1/base -> origin/gh/galv/1/base 2025-12-04T09:43:32.3478045Z * [new branch] gh/galv/1/head -> origin/gh/galv/1/head 2025-12-04T09:43:32.3480038Z * [new branch] gh/galv/1/orig -> origin/gh/galv/1/orig 2025-12-04T09:43:32.3482371Z * [new branch] gh/galv/2/base -> origin/gh/galv/2/base 2025-12-04T09:43:32.3484115Z * [new branch] gh/galv/2/head -> origin/gh/galv/2/head 2025-12-04T09:43:32.3485913Z * [new branch] gh/galv/2/orig -> origin/gh/galv/2/orig 2025-12-04T09:43:32.3488472Z * [new branch] gh/galv/3/base -> origin/gh/galv/3/base 2025-12-04T09:43:32.3489997Z * [new branch] gh/galv/3/head -> origin/gh/galv/3/head 2025-12-04T09:43:32.3491783Z * [new branch] gh/galv/3/orig -> origin/gh/galv/3/orig 2025-12-04T09:43:32.3494635Z * [new branch] gh/guangyey/134/base -> origin/gh/guangyey/134/base 2025-12-04T09:43:32.3496302Z * [new branch] gh/guangyey/134/head -> origin/gh/guangyey/134/head 2025-12-04T09:43:32.3498087Z * [new branch] gh/guangyey/134/orig -> origin/gh/guangyey/134/orig 2025-12-04T09:43:32.3500479Z * [new branch] gh/guangyey/163/base -> origin/gh/guangyey/163/base 2025-12-04T09:43:32.3502200Z * [new branch] gh/guangyey/163/head -> origin/gh/guangyey/163/head 2025-12-04T09:43:32.3503910Z * [new branch] gh/guangyey/163/orig -> origin/gh/guangyey/163/orig 2025-12-04T09:43:32.3506219Z * [new branch] gh/guangyey/168/base -> origin/gh/guangyey/168/base 2025-12-04T09:43:32.3508051Z * [new branch] gh/guangyey/168/head -> origin/gh/guangyey/168/head 2025-12-04T09:43:32.3509749Z * [new branch] gh/guangyey/168/orig -> origin/gh/guangyey/168/orig 2025-12-04T09:43:32.3512062Z * [new branch] gh/guangyey/169/base -> origin/gh/guangyey/169/base 2025-12-04T09:43:32.3513768Z * [new branch] gh/guangyey/169/head -> origin/gh/guangyey/169/head 2025-12-04T09:43:32.3515499Z * [new branch] gh/guangyey/169/orig -> origin/gh/guangyey/169/orig 2025-12-04T09:43:32.3517811Z * [new branch] gh/guangyey/170/base -> origin/gh/guangyey/170/base 2025-12-04T09:43:32.3519512Z * [new branch] gh/guangyey/170/head -> origin/gh/guangyey/170/head 2025-12-04T09:43:32.3521272Z * [new branch] gh/guangyey/170/orig -> origin/gh/guangyey/170/orig 2025-12-04T09:43:32.3523707Z * [new branch] gh/guangyey/171/base -> origin/gh/guangyey/171/base 2025-12-04T09:43:32.3525511Z * [new branch] gh/guangyey/171/head -> origin/gh/guangyey/171/head 2025-12-04T09:43:32.3527326Z * [new branch] gh/guangyey/171/orig -> origin/gh/guangyey/171/orig 2025-12-04T09:43:32.3529607Z * [new branch] gh/guangyey/178/base -> origin/gh/guangyey/178/base 2025-12-04T09:43:32.3531351Z * [new branch] gh/guangyey/178/head -> origin/gh/guangyey/178/head 2025-12-04T09:43:32.3533095Z * [new branch] gh/guangyey/178/orig -> origin/gh/guangyey/178/orig 2025-12-04T09:43:32.3535368Z * [new branch] gh/guangyey/182/base -> origin/gh/guangyey/182/base 2025-12-04T09:43:32.3537080Z * [new branch] gh/guangyey/182/head -> origin/gh/guangyey/182/head 2025-12-04T09:43:32.3538796Z * [new branch] gh/guangyey/182/orig -> origin/gh/guangyey/182/orig 2025-12-04T09:43:32.3541046Z * [new branch] gh/guangyey/183/base -> origin/gh/guangyey/183/base 2025-12-04T09:43:32.3542759Z * [new branch] gh/guangyey/183/head -> origin/gh/guangyey/183/head 2025-12-04T09:43:32.3544530Z * [new branch] gh/guangyey/183/orig -> origin/gh/guangyey/183/orig 2025-12-04T09:43:32.3546908Z * [new branch] gh/guangyey/185/base -> origin/gh/guangyey/185/base 2025-12-04T09:43:32.3548804Z * [new branch] gh/guangyey/185/head -> origin/gh/guangyey/185/head 2025-12-04T09:43:32.3550493Z * [new branch] gh/guangyey/185/orig -> origin/gh/guangyey/185/orig 2025-12-04T09:43:32.3552901Z * [new branch] gh/guangyey/186/base -> origin/gh/guangyey/186/base 2025-12-04T09:43:32.3554659Z * [new branch] gh/guangyey/186/head -> origin/gh/guangyey/186/head 2025-12-04T09:43:32.3556679Z * [new branch] gh/guangyey/186/orig -> origin/gh/guangyey/186/orig 2025-12-04T09:43:32.3558856Z * [new branch] gh/guangyey/187/base -> origin/gh/guangyey/187/base 2025-12-04T09:43:32.3560520Z * [new branch] gh/guangyey/187/head -> origin/gh/guangyey/187/head 2025-12-04T09:43:32.3562237Z * [new branch] gh/guangyey/187/orig -> origin/gh/guangyey/187/orig 2025-12-04T09:43:32.3564540Z * [new branch] gh/guangyey/188/base -> origin/gh/guangyey/188/base 2025-12-04T09:43:32.3566307Z * [new branch] gh/guangyey/188/head -> origin/gh/guangyey/188/head 2025-12-04T09:43:32.3568065Z * [new branch] gh/guangyey/188/orig -> origin/gh/guangyey/188/orig 2025-12-04T09:43:32.3570539Z * [new branch] gh/guangyey/190/base -> origin/gh/guangyey/190/base 2025-12-04T09:43:32.3572231Z * [new branch] gh/guangyey/190/head -> origin/gh/guangyey/190/head 2025-12-04T09:43:32.3573928Z * [new branch] gh/guangyey/190/orig -> origin/gh/guangyey/190/orig 2025-12-04T09:43:32.3576175Z * [new branch] gh/guangyey/208/base -> origin/gh/guangyey/208/base 2025-12-04T09:43:32.3577881Z * [new branch] gh/guangyey/208/head -> origin/gh/guangyey/208/head 2025-12-04T09:43:32.3579574Z * [new branch] gh/guangyey/208/orig -> origin/gh/guangyey/208/orig 2025-12-04T09:43:32.3581882Z * [new branch] gh/guangyey/228/base -> origin/gh/guangyey/228/base 2025-12-04T09:43:32.3583528Z * [new branch] gh/guangyey/228/head -> origin/gh/guangyey/228/head 2025-12-04T09:43:32.3585874Z * [new branch] gh/guangyey/228/orig -> origin/gh/guangyey/228/orig 2025-12-04T09:43:32.3588927Z * [new branch] gh/guangyey/230/base -> origin/gh/guangyey/230/base 2025-12-04T09:43:32.3590724Z * [new branch] gh/guangyey/230/head -> origin/gh/guangyey/230/head 2025-12-04T09:43:32.3592397Z * [new branch] gh/guangyey/230/orig -> origin/gh/guangyey/230/orig 2025-12-04T09:43:32.3594781Z * [new branch] gh/guangyey/231/base -> origin/gh/guangyey/231/base 2025-12-04T09:43:32.3596538Z * [new branch] gh/guangyey/231/head -> origin/gh/guangyey/231/head 2025-12-04T09:43:32.3598228Z * [new branch] gh/guangyey/231/orig -> origin/gh/guangyey/231/orig 2025-12-04T09:43:32.3600645Z * [new branch] gh/guangyey/232/base -> origin/gh/guangyey/232/base 2025-12-04T09:43:32.3602317Z * [new branch] gh/guangyey/232/head -> origin/gh/guangyey/232/head 2025-12-04T09:43:32.3604034Z * [new branch] gh/guangyey/232/orig -> origin/gh/guangyey/232/orig 2025-12-04T09:43:32.3606414Z * [new branch] gh/guangyey/233/base -> origin/gh/guangyey/233/base 2025-12-04T09:43:32.3608112Z * [new branch] gh/guangyey/233/head -> origin/gh/guangyey/233/head 2025-12-04T09:43:32.3609923Z * [new branch] gh/guangyey/233/orig -> origin/gh/guangyey/233/orig 2025-12-04T09:43:32.3612196Z * [new branch] gh/guangyey/234/base -> origin/gh/guangyey/234/base 2025-12-04T09:43:32.3614036Z * [new branch] gh/guangyey/234/head -> origin/gh/guangyey/234/head 2025-12-04T09:43:32.3615799Z * [new branch] gh/guangyey/234/orig -> origin/gh/guangyey/234/orig 2025-12-04T09:43:32.3618239Z * [new branch] gh/guangyey/235/base -> origin/gh/guangyey/235/base 2025-12-04T09:43:32.3619911Z * [new branch] gh/guangyey/235/head -> origin/gh/guangyey/235/head 2025-12-04T09:43:32.3621636Z * [new branch] gh/guangyey/235/orig -> origin/gh/guangyey/235/orig 2025-12-04T09:43:32.3624472Z * [new branch] gh/guangyey/236/base -> origin/gh/guangyey/236/base 2025-12-04T09:43:32.3626410Z * [new branch] gh/guangyey/236/head -> origin/gh/guangyey/236/head 2025-12-04T09:43:32.3628192Z * [new branch] gh/guangyey/236/orig -> origin/gh/guangyey/236/orig 2025-12-04T09:43:32.3630592Z * [new branch] gh/guangyey/237/base -> origin/gh/guangyey/237/base 2025-12-04T09:43:32.3632314Z * [new branch] gh/guangyey/237/head -> origin/gh/guangyey/237/head 2025-12-04T09:43:32.3634033Z * [new branch] gh/guangyey/237/orig -> origin/gh/guangyey/237/orig 2025-12-04T09:43:32.3636478Z * [new branch] gh/guangyey/238/base -> origin/gh/guangyey/238/base 2025-12-04T09:43:32.3638190Z * [new branch] gh/guangyey/238/head -> origin/gh/guangyey/238/head 2025-12-04T09:43:32.3640788Z * [new branch] gh/guangyey/239/base -> origin/gh/guangyey/239/base 2025-12-04T09:43:32.3642516Z * [new branch] gh/guangyey/239/head -> origin/gh/guangyey/239/head 2025-12-04T09:43:32.3644291Z * [new branch] gh/guangyey/239/orig -> origin/gh/guangyey/239/orig 2025-12-04T09:43:32.3646640Z * [new branch] gh/guangyey/240/base -> origin/gh/guangyey/240/base 2025-12-04T09:43:32.3648408Z * [new branch] gh/guangyey/240/head -> origin/gh/guangyey/240/head 2025-12-04T09:43:32.3650067Z * [new branch] gh/guangyey/240/orig -> origin/gh/guangyey/240/orig 2025-12-04T09:43:32.3652419Z * [new branch] gh/guangyey/241/base -> origin/gh/guangyey/241/base 2025-12-04T09:43:32.3654125Z * [new branch] gh/guangyey/241/head -> origin/gh/guangyey/241/head 2025-12-04T09:43:32.3656120Z * [new branch] gh/guangyey/241/orig -> origin/gh/guangyey/241/orig 2025-12-04T09:43:32.3658511Z * [new branch] gh/guangyey/242/base -> origin/gh/guangyey/242/base 2025-12-04T09:43:32.3660172Z * [new branch] gh/guangyey/242/head -> origin/gh/guangyey/242/head 2025-12-04T09:43:32.3661932Z * [new branch] gh/guangyey/242/orig -> origin/gh/guangyey/242/orig 2025-12-04T09:43:32.3664378Z * [new branch] gh/guangyey/243/base -> origin/gh/guangyey/243/base 2025-12-04T09:43:32.3666083Z * [new branch] gh/guangyey/243/head -> origin/gh/guangyey/243/head 2025-12-04T09:43:32.3667907Z * [new branch] gh/guangyey/243/orig -> origin/gh/guangyey/243/orig 2025-12-04T09:43:32.3670273Z * [new branch] gh/guangyey/244/base -> origin/gh/guangyey/244/base 2025-12-04T09:43:32.3671957Z * [new branch] gh/guangyey/244/head -> origin/gh/guangyey/244/head 2025-12-04T09:43:32.3673675Z * [new branch] gh/guangyey/244/orig -> origin/gh/guangyey/244/orig 2025-12-04T09:43:32.3676114Z * [new branch] gh/guangyey/245/base -> origin/gh/guangyey/245/base 2025-12-04T09:43:32.3677801Z * [new branch] gh/guangyey/245/head -> origin/gh/guangyey/245/head 2025-12-04T09:43:32.3679573Z * [new branch] gh/guangyey/245/orig -> origin/gh/guangyey/245/orig 2025-12-04T09:43:32.3681967Z * [new branch] gh/guangyey/246/base -> origin/gh/guangyey/246/base 2025-12-04T09:43:32.3683670Z * [new branch] gh/guangyey/246/head -> origin/gh/guangyey/246/head 2025-12-04T09:43:32.3685404Z * [new branch] gh/guangyey/246/orig -> origin/gh/guangyey/246/orig 2025-12-04T09:43:32.3687831Z * [new branch] gh/guangyey/247/base -> origin/gh/guangyey/247/base 2025-12-04T09:43:32.3689592Z * [new branch] gh/guangyey/247/head -> origin/gh/guangyey/247/head 2025-12-04T09:43:32.3691343Z * [new branch] gh/guangyey/247/orig -> origin/gh/guangyey/247/orig 2025-12-04T09:43:32.3693688Z * [new branch] gh/guangyey/248/base -> origin/gh/guangyey/248/base 2025-12-04T09:43:32.3695527Z * [new branch] gh/guangyey/248/head -> origin/gh/guangyey/248/head 2025-12-04T09:43:32.3697122Z * [new branch] gh/guangyey/248/orig -> origin/gh/guangyey/248/orig 2025-12-04T09:43:32.3699436Z * [new branch] gh/guangyey/249/base -> origin/gh/guangyey/249/base 2025-12-04T09:43:32.3701235Z * [new branch] gh/guangyey/249/head -> origin/gh/guangyey/249/head 2025-12-04T09:43:32.3702985Z * [new branch] gh/guangyey/249/orig -> origin/gh/guangyey/249/orig 2025-12-04T09:43:32.3705459Z * [new branch] gh/guangyey/250/base -> origin/gh/guangyey/250/base 2025-12-04T09:43:32.3707162Z * [new branch] gh/guangyey/250/head -> origin/gh/guangyey/250/head 2025-12-04T09:43:32.3709102Z * [new branch] gh/guangyey/250/orig -> origin/gh/guangyey/250/orig 2025-12-04T09:43:32.3711459Z * [new branch] gh/guangyey/251/base -> origin/gh/guangyey/251/base 2025-12-04T09:43:32.3713156Z * [new branch] gh/guangyey/251/head -> origin/gh/guangyey/251/head 2025-12-04T09:43:32.3714900Z * [new branch] gh/guangyey/251/orig -> origin/gh/guangyey/251/orig 2025-12-04T09:43:32.3717282Z * [new branch] gh/guangyey/252/base -> origin/gh/guangyey/252/base 2025-12-04T09:43:32.3719046Z * [new branch] gh/guangyey/252/head -> origin/gh/guangyey/252/head 2025-12-04T09:43:32.3720672Z * [new branch] gh/guangyey/252/orig -> origin/gh/guangyey/252/orig 2025-12-04T09:43:32.3722960Z * [new branch] gh/guangyey/253/base -> origin/gh/guangyey/253/base 2025-12-04T09:43:32.3724814Z * [new branch] gh/guangyey/253/head -> origin/gh/guangyey/253/head 2025-12-04T09:43:32.3726395Z * [new branch] gh/guangyey/253/orig -> origin/gh/guangyey/253/orig 2025-12-04T09:43:32.3728751Z * [new branch] gh/guangyey/254/base -> origin/gh/guangyey/254/base 2025-12-04T09:43:32.3730492Z * [new branch] gh/guangyey/254/head -> origin/gh/guangyey/254/head 2025-12-04T09:43:32.3732194Z * [new branch] gh/guangyey/254/orig -> origin/gh/guangyey/254/orig 2025-12-04T09:43:32.3734592Z * [new branch] gh/guangyey/255/base -> origin/gh/guangyey/255/base 2025-12-04T09:43:32.3736369Z * [new branch] gh/guangyey/255/head -> origin/gh/guangyey/255/head 2025-12-04T09:43:32.3738416Z * [new branch] gh/guangyey/255/orig -> origin/gh/guangyey/255/orig 2025-12-04T09:43:32.3742154Z * [new branch] gh/guilhermeleobas/107/base -> origin/gh/guilhermeleobas/107/base 2025-12-04T09:43:32.3743784Z * [new branch] gh/guilhermeleobas/107/head -> origin/gh/guilhermeleobas/107/head 2025-12-04T09:43:32.3745486Z * [new branch] gh/guilhermeleobas/107/orig -> origin/gh/guilhermeleobas/107/orig 2025-12-04T09:43:32.3748040Z * [new branch] gh/guilhermeleobas/108/base -> origin/gh/guilhermeleobas/108/base 2025-12-04T09:43:32.3749685Z * [new branch] gh/guilhermeleobas/108/head -> origin/gh/guilhermeleobas/108/head 2025-12-04T09:43:32.3751296Z * [new branch] gh/guilhermeleobas/108/orig -> origin/gh/guilhermeleobas/108/orig 2025-12-04T09:43:32.3753586Z * [new branch] gh/guilhermeleobas/150/base -> origin/gh/guilhermeleobas/150/base 2025-12-04T09:43:32.3759007Z * [new branch] gh/guilhermeleobas/150/head -> origin/gh/guilhermeleobas/150/head 2025-12-04T09:43:32.3759747Z * [new branch] gh/guilhermeleobas/150/orig -> origin/gh/guilhermeleobas/150/orig 2025-12-04T09:43:32.3762328Z * [new branch] gh/guilhermeleobas/168/base -> origin/gh/guilhermeleobas/168/base 2025-12-04T09:43:32.3764647Z * [new branch] gh/guilhermeleobas/168/head -> origin/gh/guilhermeleobas/168/head 2025-12-04T09:43:32.3766281Z * [new branch] gh/guilhermeleobas/168/orig -> origin/gh/guilhermeleobas/168/orig 2025-12-04T09:43:32.3768711Z * [new branch] gh/guilhermeleobas/169/base -> origin/gh/guilhermeleobas/169/base 2025-12-04T09:43:32.3770288Z * [new branch] gh/guilhermeleobas/169/head -> origin/gh/guilhermeleobas/169/head 2025-12-04T09:43:32.3771935Z * [new branch] gh/guilhermeleobas/169/orig -> origin/gh/guilhermeleobas/169/orig 2025-12-04T09:43:32.3774290Z * [new branch] gh/guilhermeleobas/170/base -> origin/gh/guilhermeleobas/170/base 2025-12-04T09:43:32.3776230Z * [new branch] gh/guilhermeleobas/170/head -> origin/gh/guilhermeleobas/170/head 2025-12-04T09:43:32.3777861Z * [new branch] gh/guilhermeleobas/170/orig -> origin/gh/guilhermeleobas/170/orig 2025-12-04T09:43:32.3780333Z * [new branch] gh/guilhermeleobas/171/base -> origin/gh/guilhermeleobas/171/base 2025-12-04T09:43:32.3782024Z * [new branch] gh/guilhermeleobas/171/head -> origin/gh/guilhermeleobas/171/head 2025-12-04T09:43:32.3783724Z * [new branch] gh/guilhermeleobas/171/orig -> origin/gh/guilhermeleobas/171/orig 2025-12-04T09:43:32.3786120Z * [new branch] gh/guilhermeleobas/173/base -> origin/gh/guilhermeleobas/173/base 2025-12-04T09:43:32.3787942Z * [new branch] gh/guilhermeleobas/173/head -> origin/gh/guilhermeleobas/173/head 2025-12-04T09:43:32.3789702Z * [new branch] gh/guilhermeleobas/173/orig -> origin/gh/guilhermeleobas/173/orig 2025-12-04T09:43:32.3791987Z * [new branch] gh/guilhermeleobas/193/base -> origin/gh/guilhermeleobas/193/base 2025-12-04T09:43:32.3793763Z * [new branch] gh/guilhermeleobas/193/head -> origin/gh/guilhermeleobas/193/head 2025-12-04T09:43:32.3795558Z * [new branch] gh/guilhermeleobas/193/orig -> origin/gh/guilhermeleobas/193/orig 2025-12-04T09:43:32.3797884Z * [new branch] gh/guilhermeleobas/204/base -> origin/gh/guilhermeleobas/204/base 2025-12-04T09:43:32.3799557Z * [new branch] gh/guilhermeleobas/204/head -> origin/gh/guilhermeleobas/204/head 2025-12-04T09:43:32.3801257Z * [new branch] gh/guilhermeleobas/204/orig -> origin/gh/guilhermeleobas/204/orig 2025-12-04T09:43:32.3803602Z * [new branch] gh/guilhermeleobas/211/base -> origin/gh/guilhermeleobas/211/base 2025-12-04T09:43:32.3805286Z * [new branch] gh/guilhermeleobas/211/head -> origin/gh/guilhermeleobas/211/head 2025-12-04T09:43:32.3806994Z * [new branch] gh/guilhermeleobas/211/orig -> origin/gh/guilhermeleobas/211/orig 2025-12-04T09:43:32.3809370Z * [new branch] gh/guilhermeleobas/226/base -> origin/gh/guilhermeleobas/226/base 2025-12-04T09:43:32.3811111Z * [new branch] gh/guilhermeleobas/226/head -> origin/gh/guilhermeleobas/226/head 2025-12-04T09:43:32.3812796Z * [new branch] gh/guilhermeleobas/226/orig -> origin/gh/guilhermeleobas/226/orig 2025-12-04T09:43:32.3815121Z * [new branch] gh/guilhermeleobas/236/base -> origin/gh/guilhermeleobas/236/base 2025-12-04T09:43:32.3816835Z * [new branch] gh/guilhermeleobas/236/head -> origin/gh/guilhermeleobas/236/head 2025-12-04T09:43:32.3818710Z * [new branch] gh/guilhermeleobas/236/orig -> origin/gh/guilhermeleobas/236/orig 2025-12-04T09:43:32.3821101Z * [new branch] gh/guilhermeleobas/247/base -> origin/gh/guilhermeleobas/247/base 2025-12-04T09:43:32.3822819Z * [new branch] gh/guilhermeleobas/247/head -> origin/gh/guilhermeleobas/247/head 2025-12-04T09:43:32.3824572Z * [new branch] gh/guilhermeleobas/247/orig -> origin/gh/guilhermeleobas/247/orig 2025-12-04T09:43:32.3826869Z * [new branch] gh/guilhermeleobas/248/base -> origin/gh/guilhermeleobas/248/base 2025-12-04T09:43:32.3828682Z * [new branch] gh/guilhermeleobas/248/head -> origin/gh/guilhermeleobas/248/head 2025-12-04T09:43:32.3831355Z * [new branch] gh/guilhermeleobas/248/orig -> origin/gh/guilhermeleobas/248/orig 2025-12-04T09:43:32.3833133Z * [new branch] gh/guilhermeleobas/250/base -> origin/gh/guilhermeleobas/250/base 2025-12-04T09:43:32.3834428Z * [new branch] gh/guilhermeleobas/250/head -> origin/gh/guilhermeleobas/250/head 2025-12-04T09:43:32.3836196Z * [new branch] gh/guilhermeleobas/250/orig -> origin/gh/guilhermeleobas/250/orig 2025-12-04T09:43:32.3839035Z * [new branch] gh/guilhermeleobas/253/base -> origin/gh/guilhermeleobas/253/base 2025-12-04T09:43:32.3840803Z * [new branch] gh/guilhermeleobas/253/head -> origin/gh/guilhermeleobas/253/head 2025-12-04T09:43:32.3842587Z * [new branch] gh/guilhermeleobas/253/orig -> origin/gh/guilhermeleobas/253/orig 2025-12-04T09:43:32.3844952Z * [new branch] gh/guilhermeleobas/254/base -> origin/gh/guilhermeleobas/254/base 2025-12-04T09:43:32.3846683Z * [new branch] gh/guilhermeleobas/254/head -> origin/gh/guilhermeleobas/254/head 2025-12-04T09:43:32.3848428Z * [new branch] gh/guilhermeleobas/254/orig -> origin/gh/guilhermeleobas/254/orig 2025-12-04T09:43:32.3850775Z * [new branch] gh/guilhermeleobas/255/base -> origin/gh/guilhermeleobas/255/base 2025-12-04T09:43:32.3852639Z * [new branch] gh/guilhermeleobas/255/head -> origin/gh/guilhermeleobas/255/head 2025-12-04T09:43:32.3854213Z * [new branch] gh/guilhermeleobas/255/orig -> origin/gh/guilhermeleobas/255/orig 2025-12-04T09:43:32.3857190Z * [new branch] gh/guilhermeleobas/256/base -> origin/gh/guilhermeleobas/256/base 2025-12-04T09:43:32.3858715Z * [new branch] gh/guilhermeleobas/256/head -> origin/gh/guilhermeleobas/256/head 2025-12-04T09:43:32.3860449Z * [new branch] gh/guilhermeleobas/256/orig -> origin/gh/guilhermeleobas/256/orig 2025-12-04T09:43:32.3862889Z * [new branch] gh/guilhermeleobas/257/base -> origin/gh/guilhermeleobas/257/base 2025-12-04T09:43:32.3864610Z * [new branch] gh/guilhermeleobas/257/head -> origin/gh/guilhermeleobas/257/head 2025-12-04T09:43:32.3866435Z * [new branch] gh/guilhermeleobas/257/orig -> origin/gh/guilhermeleobas/257/orig 2025-12-04T09:43:32.3868949Z * [new branch] gh/guilhermeleobas/258/base -> origin/gh/guilhermeleobas/258/base 2025-12-04T09:43:32.3870637Z * [new branch] gh/guilhermeleobas/258/head -> origin/gh/guilhermeleobas/258/head 2025-12-04T09:43:32.3872388Z * [new branch] gh/guilhermeleobas/258/orig -> origin/gh/guilhermeleobas/258/orig 2025-12-04T09:43:32.3874773Z * [new branch] gh/guilhermeleobas/259/base -> origin/gh/guilhermeleobas/259/base 2025-12-04T09:43:32.3876512Z * [new branch] gh/guilhermeleobas/259/head -> origin/gh/guilhermeleobas/259/head 2025-12-04T09:43:32.3878204Z * [new branch] gh/guilhermeleobas/259/orig -> origin/gh/guilhermeleobas/259/orig 2025-12-04T09:43:32.3880746Z * [new branch] gh/guilhermeleobas/260/base -> origin/gh/guilhermeleobas/260/base 2025-12-04T09:43:32.3882517Z * [new branch] gh/guilhermeleobas/260/head -> origin/gh/guilhermeleobas/260/head 2025-12-04T09:43:32.3884227Z * [new branch] gh/guilhermeleobas/260/orig -> origin/gh/guilhermeleobas/260/orig 2025-12-04T09:43:32.3886593Z * [new branch] gh/guilhermeleobas/261/base -> origin/gh/guilhermeleobas/261/base 2025-12-04T09:43:32.3888298Z * [new branch] gh/guilhermeleobas/261/head -> origin/gh/guilhermeleobas/261/head 2025-12-04T09:43:32.3889983Z * [new branch] gh/guilhermeleobas/261/orig -> origin/gh/guilhermeleobas/261/orig 2025-12-04T09:43:32.3892301Z * [new branch] gh/guilhermeleobas/262/base -> origin/gh/guilhermeleobas/262/base 2025-12-04T09:43:32.3894434Z * [new branch] gh/guilhermeleobas/262/head -> origin/gh/guilhermeleobas/262/head 2025-12-04T09:43:32.3895790Z * [new branch] gh/guilhermeleobas/262/orig -> origin/gh/guilhermeleobas/262/orig 2025-12-04T09:43:32.3898353Z * [new branch] gh/guilhermeleobas/263/base -> origin/gh/guilhermeleobas/263/base 2025-12-04T09:43:32.3900131Z * [new branch] gh/guilhermeleobas/263/head -> origin/gh/guilhermeleobas/263/head 2025-12-04T09:43:32.3901877Z * [new branch] gh/guilhermeleobas/263/orig -> origin/gh/guilhermeleobas/263/orig 2025-12-04T09:43:32.3904323Z * [new branch] gh/guilhermeleobas/264/base -> origin/gh/guilhermeleobas/264/base 2025-12-04T09:43:32.3906104Z * [new branch] gh/guilhermeleobas/264/head -> origin/gh/guilhermeleobas/264/head 2025-12-04T09:43:32.3907861Z * [new branch] gh/guilhermeleobas/264/orig -> origin/gh/guilhermeleobas/264/orig 2025-12-04T09:43:32.3910186Z * [new branch] gh/guilhermeleobas/265/base -> origin/gh/guilhermeleobas/265/base 2025-12-04T09:43:32.3911864Z * [new branch] gh/guilhermeleobas/265/head -> origin/gh/guilhermeleobas/265/head 2025-12-04T09:43:32.3913656Z * [new branch] gh/guilhermeleobas/265/orig -> origin/gh/guilhermeleobas/265/orig 2025-12-04T09:43:32.3916017Z * [new branch] gh/guilhermeleobas/266/base -> origin/gh/guilhermeleobas/266/base 2025-12-04T09:43:32.3917691Z * [new branch] gh/guilhermeleobas/266/head -> origin/gh/guilhermeleobas/266/head 2025-12-04T09:43:32.3919379Z * [new branch] gh/guilhermeleobas/266/orig -> origin/gh/guilhermeleobas/266/orig 2025-12-04T09:43:32.3921803Z * [new branch] gh/guilhermeleobas/267/base -> origin/gh/guilhermeleobas/267/base 2025-12-04T09:43:32.3924111Z * [new branch] gh/guilhermeleobas/267/head -> origin/gh/guilhermeleobas/267/head 2025-12-04T09:43:32.3925885Z * [new branch] gh/guilhermeleobas/267/orig -> origin/gh/guilhermeleobas/267/orig 2025-12-04T09:43:32.3928793Z * [new branch] gh/hameerabbasi/1/base -> origin/gh/hameerabbasi/1/base 2025-12-04T09:43:32.3930548Z * [new branch] gh/hameerabbasi/1/head -> origin/gh/hameerabbasi/1/head 2025-12-04T09:43:32.3932771Z * [new branch] gh/hameerabbasi/2/base -> origin/gh/hameerabbasi/2/base 2025-12-04T09:43:32.3934486Z * [new branch] gh/hameerabbasi/2/head -> origin/gh/hameerabbasi/2/head 2025-12-04T09:43:32.3936207Z * [new branch] gh/hameerabbasi/2/orig -> origin/gh/hameerabbasi/2/orig 2025-12-04T09:43:32.3938461Z * [new branch] gh/hameerabbasi/3/base -> origin/gh/hameerabbasi/3/base 2025-12-04T09:43:32.3940266Z * [new branch] gh/hameerabbasi/3/head -> origin/gh/hameerabbasi/3/head 2025-12-04T09:43:32.3942115Z * [new branch] gh/hameerabbasi/3/orig -> origin/gh/hameerabbasi/3/orig 2025-12-04T09:43:32.3944345Z * [new branch] gh/hameerabbasi/4/base -> origin/gh/hameerabbasi/4/base 2025-12-04T09:43:32.3946089Z * [new branch] gh/hameerabbasi/4/head -> origin/gh/hameerabbasi/4/head 2025-12-04T09:43:32.3947826Z * [new branch] gh/hameerabbasi/4/orig -> origin/gh/hameerabbasi/4/orig 2025-12-04T09:43:32.3950668Z * [new branch] gh/huydhn/1/next -> origin/gh/huydhn/1/next 2025-12-04T09:43:32.3953028Z * [new branch] gh/huydhn/2/next -> origin/gh/huydhn/2/next 2025-12-04T09:43:32.3955180Z * [new branch] gh/huydhn/3/next -> origin/gh/huydhn/3/next 2025-12-04T09:43:32.3957762Z * [new branch] gh/huydhn/4/next -> origin/gh/huydhn/4/next 2025-12-04T09:43:32.3960009Z * [new branch] gh/huydhn/5/next -> origin/gh/huydhn/5/next 2025-12-04T09:43:32.3962407Z * [new branch] gh/huydhn/6/next -> origin/gh/huydhn/6/next 2025-12-04T09:43:32.3965171Z * [new branch] gh/int3/97/base -> origin/gh/int3/97/base 2025-12-04T09:43:32.3966953Z * [new branch] gh/int3/97/head -> origin/gh/int3/97/head 2025-12-04T09:43:32.3970024Z * [new branch] gh/isuruf/101/base -> origin/gh/isuruf/101/base 2025-12-04T09:43:32.3971631Z * [new branch] gh/isuruf/101/head -> origin/gh/isuruf/101/head 2025-12-04T09:43:32.3973893Z * [new branch] gh/isuruf/146/base -> origin/gh/isuruf/146/base 2025-12-04T09:43:32.3975621Z * [new branch] gh/isuruf/146/head -> origin/gh/isuruf/146/head 2025-12-04T09:43:32.3977324Z * [new branch] gh/isuruf/146/orig -> origin/gh/isuruf/146/orig 2025-12-04T09:43:32.3979688Z * [new branch] gh/isuruf/158/base -> origin/gh/isuruf/158/base 2025-12-04T09:43:32.3981270Z * [new branch] gh/isuruf/158/head -> origin/gh/isuruf/158/head 2025-12-04T09:43:32.3983489Z * [new branch] gh/isuruf/159/base -> origin/gh/isuruf/159/base 2025-12-04T09:43:32.3985231Z * [new branch] gh/isuruf/159/head -> origin/gh/isuruf/159/head 2025-12-04T09:43:32.3987670Z * [new branch] gh/isuruf/160/base -> origin/gh/isuruf/160/base 2025-12-04T09:43:32.3989891Z * [new branch] gh/isuruf/160/head -> origin/gh/isuruf/160/head 2025-12-04T09:43:32.3991691Z * [new branch] gh/isuruf/160/orig -> origin/gh/isuruf/160/orig 2025-12-04T09:43:32.3994028Z * [new branch] gh/isuruf/81/base -> origin/gh/isuruf/81/base 2025-12-04T09:43:32.3995693Z * [new branch] gh/isuruf/81/head -> origin/gh/isuruf/81/head 2025-12-04T09:43:32.3997415Z * [new branch] gh/isuruf/81/orig -> origin/gh/isuruf/81/orig 2025-12-04T09:43:32.4000199Z * [new branch] gh/jamesjwu/176/base -> origin/gh/jamesjwu/176/base 2025-12-04T09:43:32.4001899Z * [new branch] gh/jamesjwu/176/head -> origin/gh/jamesjwu/176/head 2025-12-04T09:43:32.4003625Z * [new branch] gh/jamesjwu/176/orig -> origin/gh/jamesjwu/176/orig 2025-12-04T09:43:32.4005927Z * [new branch] gh/jamesjwu/187/base -> origin/gh/jamesjwu/187/base 2025-12-04T09:43:32.4007639Z * [new branch] gh/jamesjwu/187/head -> origin/gh/jamesjwu/187/head 2025-12-04T09:43:32.4009366Z * [new branch] gh/jamesjwu/187/orig -> origin/gh/jamesjwu/187/orig 2025-12-04T09:43:32.4011727Z * [new branch] gh/jamesjwu/196/base -> origin/gh/jamesjwu/196/base 2025-12-04T09:43:32.4013425Z * [new branch] gh/jamesjwu/196/head -> origin/gh/jamesjwu/196/head 2025-12-04T09:43:32.4015205Z * [new branch] gh/jamesjwu/196/orig -> origin/gh/jamesjwu/196/orig 2025-12-04T09:43:32.4017481Z * [new branch] gh/jamesjwu/198/base -> origin/gh/jamesjwu/198/base 2025-12-04T09:43:32.4019183Z * [new branch] gh/jamesjwu/198/head -> origin/gh/jamesjwu/198/head 2025-12-04T09:43:32.4020919Z * [new branch] gh/jamesjwu/198/orig -> origin/gh/jamesjwu/198/orig 2025-12-04T09:43:32.4023281Z * [new branch] gh/jamesjwu/207/base -> origin/gh/jamesjwu/207/base 2025-12-04T09:43:32.4025255Z * [new branch] gh/jamesjwu/207/head -> origin/gh/jamesjwu/207/head 2025-12-04T09:43:32.4027109Z * [new branch] gh/jamesjwu/207/orig -> origin/gh/jamesjwu/207/orig 2025-12-04T09:43:32.4029724Z * [new branch] gh/jamesjwu/208/base -> origin/gh/jamesjwu/208/base 2025-12-04T09:43:32.4031454Z * [new branch] gh/jamesjwu/208/head -> origin/gh/jamesjwu/208/head 2025-12-04T09:43:32.4033186Z * [new branch] gh/jamesjwu/208/orig -> origin/gh/jamesjwu/208/orig 2025-12-04T09:43:32.4035629Z * [new branch] gh/jamesjwu/52/base -> origin/gh/jamesjwu/52/base 2025-12-04T09:43:32.4037351Z * [new branch] gh/jamesjwu/52/head -> origin/gh/jamesjwu/52/head 2025-12-04T09:43:32.4039649Z * [new branch] gh/jamesjwu/53/base -> origin/gh/jamesjwu/53/base 2025-12-04T09:43:32.4041225Z * [new branch] gh/jamesjwu/53/head -> origin/gh/jamesjwu/53/head 2025-12-04T09:43:32.4043379Z * [new branch] gh/jamesjwu/54/base -> origin/gh/jamesjwu/54/base 2025-12-04T09:43:32.4045072Z * [new branch] gh/jamesjwu/54/head -> origin/gh/jamesjwu/54/head 2025-12-04T09:43:32.4047240Z * [new branch] gh/jamesjwu/55/base -> origin/gh/jamesjwu/55/base 2025-12-04T09:43:32.4048947Z * [new branch] gh/jamesjwu/55/head -> origin/gh/jamesjwu/55/head 2025-12-04T09:43:32.4051065Z * [new branch] gh/jamesjwu/56/base -> origin/gh/jamesjwu/56/base 2025-12-04T09:43:32.4052771Z * [new branch] gh/jamesjwu/56/head -> origin/gh/jamesjwu/56/head 2025-12-04T09:43:32.4055055Z * [new branch] gh/jamesjwu/57/base -> origin/gh/jamesjwu/57/base 2025-12-04T09:43:32.4056977Z * [new branch] gh/jamesjwu/57/head -> origin/gh/jamesjwu/57/head 2025-12-04T09:43:32.4059099Z * [new branch] gh/jamesjwu/58/base -> origin/gh/jamesjwu/58/base 2025-12-04T09:43:32.4060838Z * [new branch] gh/jamesjwu/58/head -> origin/gh/jamesjwu/58/head 2025-12-04T09:43:32.4062989Z * [new branch] gh/jamesjwu/59/base -> origin/gh/jamesjwu/59/base 2025-12-04T09:43:32.4064720Z * [new branch] gh/jamesjwu/59/head -> origin/gh/jamesjwu/59/head 2025-12-04T09:43:32.4066897Z * [new branch] gh/jamesjwu/60/base -> origin/gh/jamesjwu/60/base 2025-12-04T09:43:32.4068717Z * [new branch] gh/jamesjwu/60/head -> origin/gh/jamesjwu/60/head 2025-12-04T09:43:32.4070862Z * [new branch] gh/jamesjwu/61/base -> origin/gh/jamesjwu/61/base 2025-12-04T09:43:32.4072494Z * [new branch] gh/jamesjwu/61/head -> origin/gh/jamesjwu/61/head 2025-12-04T09:43:32.4074731Z * [new branch] gh/jamesjwu/62/base -> origin/gh/jamesjwu/62/base 2025-12-04T09:43:32.4076441Z * [new branch] gh/jamesjwu/62/head -> origin/gh/jamesjwu/62/head 2025-12-04T09:43:32.4079095Z * [new branch] gh/jamesjwu/63/base -> origin/gh/jamesjwu/63/base 2025-12-04T09:43:32.4080902Z * [new branch] gh/jamesjwu/63/head -> origin/gh/jamesjwu/63/head 2025-12-04T09:43:32.4083732Z * [new branch] gh/jamesjwu/64/base -> origin/gh/jamesjwu/64/base 2025-12-04T09:43:32.4085451Z * [new branch] gh/jamesjwu/64/head -> origin/gh/jamesjwu/64/head 2025-12-04T09:43:32.4087791Z * [new branch] gh/jamesjwu/65/base -> origin/gh/jamesjwu/65/base 2025-12-04T09:43:32.4089436Z * [new branch] gh/jamesjwu/65/head -> origin/gh/jamesjwu/65/head 2025-12-04T09:43:32.4092340Z * [new branch] gh/janeyx99/165/base -> origin/gh/janeyx99/165/base 2025-12-04T09:43:32.4094142Z * [new branch] gh/janeyx99/165/head -> origin/gh/janeyx99/165/head 2025-12-04T09:43:32.4095854Z * [new branch] gh/janeyx99/165/orig -> origin/gh/janeyx99/165/orig 2025-12-04T09:43:32.4098166Z * [new branch] gh/janeyx99/201/base -> origin/gh/janeyx99/201/base 2025-12-04T09:43:32.4099830Z * [new branch] gh/janeyx99/201/head -> origin/gh/janeyx99/201/head 2025-12-04T09:43:32.4101550Z * [new branch] gh/janeyx99/201/orig -> origin/gh/janeyx99/201/orig 2025-12-04T09:43:32.4104040Z * [new branch] gh/janeyx99/225/base -> origin/gh/janeyx99/225/base 2025-12-04T09:43:32.4105783Z * [new branch] gh/janeyx99/225/head -> origin/gh/janeyx99/225/head 2025-12-04T09:43:32.4107589Z * [new branch] gh/janeyx99/225/orig -> origin/gh/janeyx99/225/orig 2025-12-04T09:43:32.4109973Z * [new branch] gh/janeyx99/299/base -> origin/gh/janeyx99/299/base 2025-12-04T09:43:32.4111838Z * [new branch] gh/janeyx99/299/head -> origin/gh/janeyx99/299/head 2025-12-04T09:43:32.4113471Z * [new branch] gh/janeyx99/299/orig -> origin/gh/janeyx99/299/orig 2025-12-04T09:43:32.4116010Z * [new branch] gh/janeyx99/302/base -> origin/gh/janeyx99/302/base 2025-12-04T09:43:32.4117752Z * [new branch] gh/janeyx99/302/head -> origin/gh/janeyx99/302/head 2025-12-04T09:43:32.4119977Z * [new branch] gh/janeyx99/303/base -> origin/gh/janeyx99/303/base 2025-12-04T09:43:32.4121689Z * [new branch] gh/janeyx99/303/head -> origin/gh/janeyx99/303/head 2025-12-04T09:43:32.4124029Z * [new branch] gh/janeyx99/305/base -> origin/gh/janeyx99/305/base 2025-12-04T09:43:32.4125739Z * [new branch] gh/janeyx99/305/head -> origin/gh/janeyx99/305/head 2025-12-04T09:43:32.4128071Z * [new branch] gh/janeyx99/306/base -> origin/gh/janeyx99/306/base 2025-12-04T09:43:32.4129696Z * [new branch] gh/janeyx99/306/head -> origin/gh/janeyx99/306/head 2025-12-04T09:43:32.4132431Z * [new branch] gh/janeyx99/314/base -> origin/gh/janeyx99/314/base 2025-12-04T09:43:32.4134230Z * [new branch] gh/janeyx99/314/head -> origin/gh/janeyx99/314/head 2025-12-04T09:43:32.4136009Z * [new branch] gh/janeyx99/314/orig -> origin/gh/janeyx99/314/orig 2025-12-04T09:43:32.4138315Z * [new branch] gh/janeyx99/315/base -> origin/gh/janeyx99/315/base 2025-12-04T09:43:32.4140241Z * [new branch] gh/janeyx99/315/head -> origin/gh/janeyx99/315/head 2025-12-04T09:43:32.4142178Z * [new branch] gh/janeyx99/315/orig -> origin/gh/janeyx99/315/orig 2025-12-04T09:43:32.4144546Z * [new branch] gh/janeyx99/316/base -> origin/gh/janeyx99/316/base 2025-12-04T09:43:32.4146314Z * [new branch] gh/janeyx99/316/head -> origin/gh/janeyx99/316/head 2025-12-04T09:43:32.4148089Z * [new branch] gh/janeyx99/316/orig -> origin/gh/janeyx99/316/orig 2025-12-04T09:43:32.4150782Z * [new branch] gh/janeyx99/317/base -> origin/gh/janeyx99/317/base 2025-12-04T09:43:32.4152432Z * [new branch] gh/janeyx99/317/head -> origin/gh/janeyx99/317/head 2025-12-04T09:43:32.4154121Z * [new branch] gh/janeyx99/317/orig -> origin/gh/janeyx99/317/orig 2025-12-04T09:43:32.4158412Z * [new branch] gh/janeyx99/325/base -> origin/gh/janeyx99/325/base 2025-12-04T09:43:32.4160181Z * [new branch] gh/janeyx99/325/head -> origin/gh/janeyx99/325/head 2025-12-04T09:43:32.4161950Z * [new branch] gh/janeyx99/325/orig -> origin/gh/janeyx99/325/orig 2025-12-04T09:43:32.4164245Z * [new branch] gh/janeyx99/327/base -> origin/gh/janeyx99/327/base 2025-12-04T09:43:32.4165883Z * [new branch] gh/janeyx99/327/head -> origin/gh/janeyx99/327/head 2025-12-04T09:43:32.4167652Z * [new branch] gh/janeyx99/327/orig -> origin/gh/janeyx99/327/orig 2025-12-04T09:43:32.4170042Z * [new branch] gh/janeyx99/328/base -> origin/gh/janeyx99/328/base 2025-12-04T09:43:32.4171855Z * [new branch] gh/janeyx99/328/head -> origin/gh/janeyx99/328/head 2025-12-04T09:43:32.4173601Z * [new branch] gh/janeyx99/328/orig -> origin/gh/janeyx99/328/orig 2025-12-04T09:43:32.4175812Z * [new branch] gh/janeyx99/329/base -> origin/gh/janeyx99/329/base 2025-12-04T09:43:32.4177559Z * [new branch] gh/janeyx99/329/head -> origin/gh/janeyx99/329/head 2025-12-04T09:43:32.4179275Z * [new branch] gh/janeyx99/329/orig -> origin/gh/janeyx99/329/orig 2025-12-04T09:43:32.4182313Z * [new branch] gh/janeyx99/330/base -> origin/gh/janeyx99/330/base 2025-12-04T09:43:32.4184092Z * [new branch] gh/janeyx99/330/head -> origin/gh/janeyx99/330/head 2025-12-04T09:43:32.4185670Z * [new branch] gh/janeyx99/330/orig -> origin/gh/janeyx99/330/orig 2025-12-04T09:43:32.4188199Z * [new branch] gh/janeyx99/331/base -> origin/gh/janeyx99/331/base 2025-12-04T09:43:32.4189948Z * [new branch] gh/janeyx99/331/head -> origin/gh/janeyx99/331/head 2025-12-04T09:43:32.4192159Z * [new branch] gh/janeyx99/331/orig -> origin/gh/janeyx99/331/orig 2025-12-04T09:43:32.4194737Z * [new branch] gh/janeyx99/332/base -> origin/gh/janeyx99/332/base 2025-12-04T09:43:32.4196358Z * [new branch] gh/janeyx99/332/head -> origin/gh/janeyx99/332/head 2025-12-04T09:43:32.4198056Z * [new branch] gh/janeyx99/332/orig -> origin/gh/janeyx99/332/orig 2025-12-04T09:43:32.4200301Z * [new branch] gh/janeyx99/333/base -> origin/gh/janeyx99/333/base 2025-12-04T09:43:32.4201989Z * [new branch] gh/janeyx99/333/head -> origin/gh/janeyx99/333/head 2025-12-04T09:43:32.4203802Z * [new branch] gh/janeyx99/333/orig -> origin/gh/janeyx99/333/orig 2025-12-04T09:43:32.4206265Z * [new branch] gh/janeyx99/88/base -> origin/gh/janeyx99/88/base 2025-12-04T09:43:32.4207970Z * [new branch] gh/janeyx99/88/head -> origin/gh/janeyx99/88/head 2025-12-04T09:43:32.4209694Z * [new branch] gh/janeyx99/88/orig -> origin/gh/janeyx99/88/orig 2025-12-04T09:43:32.4212589Z * [new branch] gh/jansel/360/base -> origin/gh/jansel/360/base 2025-12-04T09:43:32.4214379Z * [new branch] gh/jansel/360/head -> origin/gh/jansel/360/head 2025-12-04T09:43:32.4216760Z * [new branch] gh/jansel/451/base -> origin/gh/jansel/451/base 2025-12-04T09:43:32.4218543Z * [new branch] gh/jansel/451/head -> origin/gh/jansel/451/head 2025-12-04T09:43:32.4220312Z * [new branch] gh/jansel/451/orig -> origin/gh/jansel/451/orig 2025-12-04T09:43:32.4222591Z * [new branch] gh/jansel/462/base -> origin/gh/jansel/462/base 2025-12-04T09:43:32.4224333Z * [new branch] gh/jansel/462/head -> origin/gh/jansel/462/head 2025-12-04T09:43:32.4226009Z * [new branch] gh/jansel/462/orig -> origin/gh/jansel/462/orig 2025-12-04T09:43:32.4228478Z * [new branch] gh/jansel/533/base -> origin/gh/jansel/533/base 2025-12-04T09:43:32.4230238Z * [new branch] gh/jansel/533/head -> origin/gh/jansel/533/head 2025-12-04T09:43:32.4231976Z * [new branch] gh/jansel/533/orig -> origin/gh/jansel/533/orig 2025-12-04T09:43:32.4234397Z * [new branch] gh/jansel/552/base -> origin/gh/jansel/552/base 2025-12-04T09:43:32.4236096Z * [new branch] gh/jansel/552/head -> origin/gh/jansel/552/head 2025-12-04T09:43:32.4238382Z * [new branch] gh/jansel/552/orig -> origin/gh/jansel/552/orig 2025-12-04T09:43:32.4240681Z * [new branch] gh/jansel/553/base -> origin/gh/jansel/553/base 2025-12-04T09:43:32.4242354Z * [new branch] gh/jansel/553/head -> origin/gh/jansel/553/head 2025-12-04T09:43:32.4244056Z * [new branch] gh/jansel/553/orig -> origin/gh/jansel/553/orig 2025-12-04T09:43:32.4246829Z * [new branch] gh/jansel/554/base -> origin/gh/jansel/554/base 2025-12-04T09:43:32.4248563Z * [new branch] gh/jansel/554/head -> origin/gh/jansel/554/head 2025-12-04T09:43:32.4250332Z * [new branch] gh/jansel/554/orig -> origin/gh/jansel/554/orig 2025-12-04T09:43:32.4252669Z * [new branch] gh/jansel/555/base -> origin/gh/jansel/555/base 2025-12-04T09:43:32.4254444Z * [new branch] gh/jansel/555/head -> origin/gh/jansel/555/head 2025-12-04T09:43:32.4256257Z * [new branch] gh/jansel/555/orig -> origin/gh/jansel/555/orig 2025-12-04T09:43:32.4258820Z * [new branch] gh/jansel/556/base -> origin/gh/jansel/556/base 2025-12-04T09:43:32.4260326Z * [new branch] gh/jansel/556/head -> origin/gh/jansel/556/head 2025-12-04T09:43:32.4262042Z * [new branch] gh/jansel/556/orig -> origin/gh/jansel/556/orig 2025-12-04T09:43:32.4264448Z * [new branch] gh/jansel/557/base -> origin/gh/jansel/557/base 2025-12-04T09:43:32.4266136Z * [new branch] gh/jansel/557/head -> origin/gh/jansel/557/head 2025-12-04T09:43:32.4267934Z * [new branch] gh/jansel/557/orig -> origin/gh/jansel/557/orig 2025-12-04T09:43:32.4270145Z * [new branch] gh/jansel/558/base -> origin/gh/jansel/558/base 2025-12-04T09:43:32.4271870Z * [new branch] gh/jansel/558/head -> origin/gh/jansel/558/head 2025-12-04T09:43:32.4273560Z * [new branch] gh/jansel/558/orig -> origin/gh/jansel/558/orig 2025-12-04T09:43:32.4275931Z * [new branch] gh/jansel/559/base -> origin/gh/jansel/559/base 2025-12-04T09:43:32.4277628Z * [new branch] gh/jansel/559/head -> origin/gh/jansel/559/head 2025-12-04T09:43:32.4279345Z * [new branch] gh/jansel/559/orig -> origin/gh/jansel/559/orig 2025-12-04T09:43:32.4281669Z * [new branch] gh/jansel/560/base -> origin/gh/jansel/560/base 2025-12-04T09:43:32.4283325Z * [new branch] gh/jansel/560/head -> origin/gh/jansel/560/head 2025-12-04T09:43:32.4285011Z * [new branch] gh/jansel/560/orig -> origin/gh/jansel/560/orig 2025-12-04T09:43:32.4287471Z * [new branch] gh/jansel/561/base -> origin/gh/jansel/561/base 2025-12-04T09:43:32.4289152Z * [new branch] gh/jansel/561/head -> origin/gh/jansel/561/head 2025-12-04T09:43:32.4290850Z * [new branch] gh/jansel/561/orig -> origin/gh/jansel/561/orig 2025-12-04T09:43:32.4293203Z * [new branch] gh/jansel/562/base -> origin/gh/jansel/562/base 2025-12-04T09:43:32.4294918Z * [new branch] gh/jansel/562/head -> origin/gh/jansel/562/head 2025-12-04T09:43:32.4296587Z * [new branch] gh/jansel/562/orig -> origin/gh/jansel/562/orig 2025-12-04T09:43:32.4298952Z * [new branch] gh/jansel/563/base -> origin/gh/jansel/563/base 2025-12-04T09:43:32.4300660Z * [new branch] gh/jansel/563/head -> origin/gh/jansel/563/head 2025-12-04T09:43:32.4302404Z * [new branch] gh/jansel/563/orig -> origin/gh/jansel/563/orig 2025-12-04T09:43:32.4305269Z * [new branch] gh/jansel/564/base -> origin/gh/jansel/564/base 2025-12-04T09:43:32.4307004Z * [new branch] gh/jansel/564/head -> origin/gh/jansel/564/head 2025-12-04T09:43:32.4308891Z * [new branch] gh/jansel/564/orig -> origin/gh/jansel/564/orig 2025-12-04T09:43:32.4311893Z * [new branch] gh/jansel/565/base -> origin/gh/jansel/565/base 2025-12-04T09:43:32.4313554Z * [new branch] gh/jansel/565/head -> origin/gh/jansel/565/head 2025-12-04T09:43:32.4315276Z * [new branch] gh/jansel/565/orig -> origin/gh/jansel/565/orig 2025-12-04T09:43:32.4318143Z * [new branch] gh/jansel/566/base -> origin/gh/jansel/566/base 2025-12-04T09:43:32.4319937Z * [new branch] gh/jansel/566/head -> origin/gh/jansel/566/head 2025-12-04T09:43:32.4321613Z * [new branch] gh/jansel/566/orig -> origin/gh/jansel/566/orig 2025-12-04T09:43:32.4324012Z * [new branch] gh/jansel/567/base -> origin/gh/jansel/567/base 2025-12-04T09:43:32.4325825Z * [new branch] gh/jansel/567/head -> origin/gh/jansel/567/head 2025-12-04T09:43:32.4327537Z * [new branch] gh/jansel/567/orig -> origin/gh/jansel/567/orig 2025-12-04T09:43:32.4329891Z * [new branch] gh/jansel/568/base -> origin/gh/jansel/568/base 2025-12-04T09:43:32.4331601Z * [new branch] gh/jansel/568/head -> origin/gh/jansel/568/head 2025-12-04T09:43:32.4333330Z * [new branch] gh/jansel/568/orig -> origin/gh/jansel/568/orig 2025-12-04T09:43:32.4335760Z * [new branch] gh/jansel/569/base -> origin/gh/jansel/569/base 2025-12-04T09:43:32.4337535Z * [new branch] gh/jansel/569/head -> origin/gh/jansel/569/head 2025-12-04T09:43:32.4339243Z * [new branch] gh/jansel/569/orig -> origin/gh/jansel/569/orig 2025-12-04T09:43:32.4341645Z * [new branch] gh/jansel/570/base -> origin/gh/jansel/570/base 2025-12-04T09:43:32.4343335Z * [new branch] gh/jansel/570/head -> origin/gh/jansel/570/head 2025-12-04T09:43:32.4345072Z * [new branch] gh/jansel/570/orig -> origin/gh/jansel/570/orig 2025-12-04T09:43:32.4347497Z * [new branch] gh/jansel/571/base -> origin/gh/jansel/571/base 2025-12-04T09:43:32.4349256Z * [new branch] gh/jansel/571/head -> origin/gh/jansel/571/head 2025-12-04T09:43:32.4350954Z * [new branch] gh/jansel/571/orig -> origin/gh/jansel/571/orig 2025-12-04T09:43:32.4353707Z * [new branch] gh/jansel/572/base -> origin/gh/jansel/572/base 2025-12-04T09:43:32.4355596Z * [new branch] gh/jansel/572/head -> origin/gh/jansel/572/head 2025-12-04T09:43:32.4357300Z * [new branch] gh/jansel/572/orig -> origin/gh/jansel/572/orig 2025-12-04T09:43:32.4359778Z * [new branch] gh/jansel/573/base -> origin/gh/jansel/573/base 2025-12-04T09:43:32.4361490Z * [new branch] gh/jansel/573/head -> origin/gh/jansel/573/head 2025-12-04T09:43:32.4363368Z * [new branch] gh/jansel/573/orig -> origin/gh/jansel/573/orig 2025-12-04T09:43:32.4365911Z * [new branch] gh/jansel/574/base -> origin/gh/jansel/574/base 2025-12-04T09:43:32.4367958Z * [new branch] gh/jansel/574/head -> origin/gh/jansel/574/head 2025-12-04T09:43:32.4369857Z * [new branch] gh/jansel/574/orig -> origin/gh/jansel/574/orig 2025-12-04T09:43:32.4372054Z * [new branch] gh/jansel/575/base -> origin/gh/jansel/575/base 2025-12-04T09:43:32.4373845Z * [new branch] gh/jansel/575/head -> origin/gh/jansel/575/head 2025-12-04T09:43:32.4375926Z * [new branch] gh/jansel/575/orig -> origin/gh/jansel/575/orig 2025-12-04T09:43:32.4378305Z * [new branch] gh/jansel/576/base -> origin/gh/jansel/576/base 2025-12-04T09:43:32.4380779Z * [new branch] gh/jansel/576/head -> origin/gh/jansel/576/head 2025-12-04T09:43:32.4382559Z * [new branch] gh/jansel/576/orig -> origin/gh/jansel/576/orig 2025-12-04T09:43:32.4385473Z * [new branch] gh/jbschlosser/247/base -> origin/gh/jbschlosser/247/base 2025-12-04T09:43:32.4387184Z * [new branch] gh/jbschlosser/247/head -> origin/gh/jbschlosser/247/head 2025-12-04T09:43:32.4389052Z * [new branch] gh/jbschlosser/247/orig -> origin/gh/jbschlosser/247/orig 2025-12-04T09:43:32.4391478Z * [new branch] gh/jbschlosser/250/base -> origin/gh/jbschlosser/250/base 2025-12-04T09:43:32.4393277Z * [new branch] gh/jbschlosser/250/head -> origin/gh/jbschlosser/250/head 2025-12-04T09:43:32.4394981Z * [new branch] gh/jbschlosser/250/orig -> origin/gh/jbschlosser/250/orig 2025-12-04T09:43:32.4398028Z * [new branch] gh/jerryzh168/1/base -> origin/gh/jerryzh168/1/base 2025-12-04T09:43:32.4399591Z * [new branch] gh/jerryzh168/1/head -> origin/gh/jerryzh168/1/head 2025-12-04T09:43:32.4401313Z * [new branch] gh/jerryzh168/1/orig -> origin/gh/jerryzh168/1/orig 2025-12-04T09:43:32.4404069Z * [new branch] gh/jiayisunx/59/base -> origin/gh/jiayisunx/59/base 2025-12-04T09:43:32.4405865Z * [new branch] gh/jiayisunx/59/head -> origin/gh/jiayisunx/59/head 2025-12-04T09:43:32.4407618Z * [new branch] gh/jiayisunx/59/orig -> origin/gh/jiayisunx/59/orig 2025-12-04T09:43:32.4409828Z * [new branch] gh/jiayisunx/61/base -> origin/gh/jiayisunx/61/base 2025-12-04T09:43:32.4411558Z * [new branch] gh/jiayisunx/61/head -> origin/gh/jiayisunx/61/head 2025-12-04T09:43:32.4413301Z * [new branch] gh/jiayisunx/61/orig -> origin/gh/jiayisunx/61/orig 2025-12-04T09:43:32.4415672Z * [new branch] gh/jiayisunx/68/base -> origin/gh/jiayisunx/68/base 2025-12-04T09:43:32.4417300Z * [new branch] gh/jiayisunx/68/head -> origin/gh/jiayisunx/68/head 2025-12-04T09:43:32.4418994Z * [new branch] gh/jiayisunx/68/orig -> origin/gh/jiayisunx/68/orig 2025-12-04T09:43:32.4421346Z * [new branch] gh/jiayisunx/77/base -> origin/gh/jiayisunx/77/base 2025-12-04T09:43:32.4423119Z * [new branch] gh/jiayisunx/77/head -> origin/gh/jiayisunx/77/head 2025-12-04T09:43:32.4424818Z * [new branch] gh/jiayisunx/77/orig -> origin/gh/jiayisunx/77/orig 2025-12-04T09:43:32.4427151Z * [new branch] gh/jiayisunx/78/base -> origin/gh/jiayisunx/78/base 2025-12-04T09:43:32.4429059Z * [new branch] gh/jiayisunx/78/head -> origin/gh/jiayisunx/78/head 2025-12-04T09:43:32.4430814Z * [new branch] gh/jiayisunx/78/orig -> origin/gh/jiayisunx/78/orig 2025-12-04T09:43:32.4433072Z * [new branch] gh/jiayisunx/79/base -> origin/gh/jiayisunx/79/base 2025-12-04T09:43:32.4434789Z * [new branch] gh/jiayisunx/79/head -> origin/gh/jiayisunx/79/head 2025-12-04T09:43:32.4436482Z * [new branch] gh/jiayisunx/79/orig -> origin/gh/jiayisunx/79/orig 2025-12-04T09:43:32.4438954Z * [new branch] gh/jiayisunx/82/base -> origin/gh/jiayisunx/82/base 2025-12-04T09:43:32.4440721Z * [new branch] gh/jiayisunx/82/head -> origin/gh/jiayisunx/82/head 2025-12-04T09:43:32.4442450Z * [new branch] gh/jiayisunx/82/orig -> origin/gh/jiayisunx/82/orig 2025-12-04T09:43:32.4444711Z * [new branch] gh/jiayisunx/83/base -> origin/gh/jiayisunx/83/base 2025-12-04T09:43:32.4446468Z * [new branch] gh/jiayisunx/83/head -> origin/gh/jiayisunx/83/head 2025-12-04T09:43:32.4448075Z * [new branch] gh/jiayisunx/83/orig -> origin/gh/jiayisunx/83/orig 2025-12-04T09:43:32.4450285Z * [new branch] gh/jiayisunx/84/base -> origin/gh/jiayisunx/84/base 2025-12-04T09:43:32.4452048Z * [new branch] gh/jiayisunx/84/head -> origin/gh/jiayisunx/84/head 2025-12-04T09:43:32.4453779Z * [new branch] gh/jiayisunx/84/orig -> origin/gh/jiayisunx/84/orig 2025-12-04T09:43:32.4456390Z * [new branch] gh/jiayisunx/85/base -> origin/gh/jiayisunx/85/base 2025-12-04T09:43:32.4458169Z * [new branch] gh/jiayisunx/85/head -> origin/gh/jiayisunx/85/head 2025-12-04T09:43:32.4459869Z * [new branch] gh/jiayisunx/85/orig -> origin/gh/jiayisunx/85/orig 2025-12-04T09:43:32.4462612Z * [new branch] gh/jiayisunx/86/base -> origin/gh/jiayisunx/86/base 2025-12-04T09:43:32.4464335Z * [new branch] gh/jiayisunx/86/head -> origin/gh/jiayisunx/86/head 2025-12-04T09:43:32.4466279Z * [new branch] gh/jiayisunx/86/orig -> origin/gh/jiayisunx/86/orig 2025-12-04T09:43:32.4468584Z * [new branch] gh/jiayisunx/87/base -> origin/gh/jiayisunx/87/base 2025-12-04T09:43:32.4470286Z * [new branch] gh/jiayisunx/87/head -> origin/gh/jiayisunx/87/head 2025-12-04T09:43:32.4471994Z * [new branch] gh/jiayisunx/87/orig -> origin/gh/jiayisunx/87/orig 2025-12-04T09:43:32.4474276Z * [new branch] gh/jiayisunx/88/base -> origin/gh/jiayisunx/88/base 2025-12-04T09:43:32.4476093Z * [new branch] gh/jiayisunx/88/head -> origin/gh/jiayisunx/88/head 2025-12-04T09:43:32.4477784Z * [new branch] gh/jiayisunx/88/orig -> origin/gh/jiayisunx/88/orig 2025-12-04T09:43:32.4480098Z * [new branch] gh/jiayisunx/89/base -> origin/gh/jiayisunx/89/base 2025-12-04T09:43:32.4481771Z * [new branch] gh/jiayisunx/89/head -> origin/gh/jiayisunx/89/head 2025-12-04T09:43:32.4483450Z * [new branch] gh/jiayisunx/89/orig -> origin/gh/jiayisunx/89/orig 2025-12-04T09:43:32.4485732Z * [new branch] gh/jiayisunx/90/base -> origin/gh/jiayisunx/90/base 2025-12-04T09:43:32.4487421Z * [new branch] gh/jiayisunx/90/head -> origin/gh/jiayisunx/90/head 2025-12-04T09:43:32.4489137Z * [new branch] gh/jiayisunx/90/orig -> origin/gh/jiayisunx/90/orig 2025-12-04T09:43:32.4491785Z * [new branch] gh/jjwu@meta.com/1/base -> origin/gh/jjwu@meta.com/1/base 2025-12-04T09:43:32.4493472Z * [new branch] gh/jjwu@meta.com/1/head -> origin/gh/jjwu@meta.com/1/head 2025-12-04T09:43:32.4496756Z * [new branch] gh/jturney/1/base -> origin/gh/jturney/1/base 2025-12-04T09:43:32.4498484Z * [new branch] gh/jturney/1/head -> origin/gh/jturney/1/head 2025-12-04T09:43:32.4500215Z * [new branch] gh/jturney/1/orig -> origin/gh/jturney/1/orig 2025-12-04T09:43:32.4502545Z * [new branch] gh/jturney/2/base -> origin/gh/jturney/2/base 2025-12-04T09:43:32.4504217Z * [new branch] gh/jturney/2/head -> origin/gh/jturney/2/head 2025-12-04T09:43:32.4505890Z * [new branch] gh/jturney/2/orig -> origin/gh/jturney/2/orig 2025-12-04T09:43:32.4508896Z * [new branch] gh/karthickai/10/base -> origin/gh/karthickai/10/base 2025-12-04T09:43:32.4510807Z * [new branch] gh/karthickai/10/head -> origin/gh/karthickai/10/head 2025-12-04T09:43:32.4512475Z * [new branch] gh/karthickai/10/orig -> origin/gh/karthickai/10/orig 2025-12-04T09:43:32.4514897Z * [new branch] gh/karthickai/11/base -> origin/gh/karthickai/11/base 2025-12-04T09:43:32.4516632Z * [new branch] gh/karthickai/11/head -> origin/gh/karthickai/11/head 2025-12-04T09:43:32.4518484Z * [new branch] gh/karthickai/11/orig -> origin/gh/karthickai/11/orig 2025-12-04T09:43:32.4521112Z * [new branch] gh/karthickai/12/base -> origin/gh/karthickai/12/base 2025-12-04T09:43:32.4522895Z * [new branch] gh/karthickai/12/head -> origin/gh/karthickai/12/head 2025-12-04T09:43:32.4524800Z * [new branch] gh/karthickai/12/orig -> origin/gh/karthickai/12/orig 2025-12-04T09:43:32.4527135Z * [new branch] gh/karthickai/13/base -> origin/gh/karthickai/13/base 2025-12-04T09:43:32.4528908Z * [new branch] gh/karthickai/13/head -> origin/gh/karthickai/13/head 2025-12-04T09:43:32.4530577Z * [new branch] gh/karthickai/13/orig -> origin/gh/karthickai/13/orig 2025-12-04T09:43:32.4533003Z * [new branch] gh/karthickai/14/base -> origin/gh/karthickai/14/base 2025-12-04T09:43:32.4534801Z * [new branch] gh/karthickai/14/head -> origin/gh/karthickai/14/head 2025-12-04T09:43:32.4536573Z * [new branch] gh/karthickai/14/orig -> origin/gh/karthickai/14/orig 2025-12-04T09:43:32.4539062Z * [new branch] gh/karthickai/15/base -> origin/gh/karthickai/15/base 2025-12-04T09:43:32.4541234Z * [new branch] gh/karthickai/15/head -> origin/gh/karthickai/15/head 2025-12-04T09:43:32.4542921Z * [new branch] gh/karthickai/15/orig -> origin/gh/karthickai/15/orig 2025-12-04T09:43:32.4545244Z * [new branch] gh/karthickai/16/base -> origin/gh/karthickai/16/base 2025-12-04T09:43:32.4546999Z * [new branch] gh/karthickai/16/head -> origin/gh/karthickai/16/head 2025-12-04T09:43:32.4548894Z * [new branch] gh/karthickai/16/orig -> origin/gh/karthickai/16/orig 2025-12-04T09:43:32.4551084Z * [new branch] gh/karthickai/17/base -> origin/gh/karthickai/17/base 2025-12-04T09:43:32.4552732Z * [new branch] gh/karthickai/17/head -> origin/gh/karthickai/17/head 2025-12-04T09:43:32.4554407Z * [new branch] gh/karthickai/17/orig -> origin/gh/karthickai/17/orig 2025-12-04T09:43:32.4558523Z * [new branch] gh/karthickai/18/base -> origin/gh/karthickai/18/base 2025-12-04T09:43:32.4560405Z * [new branch] gh/karthickai/18/head -> origin/gh/karthickai/18/head 2025-12-04T09:43:32.4562154Z * [new branch] gh/karthickai/18/orig -> origin/gh/karthickai/18/orig 2025-12-04T09:43:32.4564538Z * [new branch] gh/karthickai/19/base -> origin/gh/karthickai/19/base 2025-12-04T09:43:32.4566267Z * [new branch] gh/karthickai/19/head -> origin/gh/karthickai/19/head 2025-12-04T09:43:32.4567977Z * [new branch] gh/karthickai/19/orig -> origin/gh/karthickai/19/orig 2025-12-04T09:43:32.4570931Z * [new branch] gh/karthickai/20/base -> origin/gh/karthickai/20/base 2025-12-04T09:43:32.4573011Z * [new branch] gh/karthickai/20/head -> origin/gh/karthickai/20/head 2025-12-04T09:43:32.4574793Z * [new branch] gh/karthickai/20/orig -> origin/gh/karthickai/20/orig 2025-12-04T09:43:32.4577240Z * [new branch] gh/karthickai/21/base -> origin/gh/karthickai/21/base 2025-12-04T09:43:32.4579049Z * [new branch] gh/karthickai/21/head -> origin/gh/karthickai/21/head 2025-12-04T09:43:32.4580844Z * [new branch] gh/karthickai/21/orig -> origin/gh/karthickai/21/orig 2025-12-04T09:43:32.4583293Z * [new branch] gh/karthickai/22/base -> origin/gh/karthickai/22/base 2025-12-04T09:43:32.4585004Z * [new branch] gh/karthickai/22/head -> origin/gh/karthickai/22/head 2025-12-04T09:43:32.4586738Z * [new branch] gh/karthickai/22/orig -> origin/gh/karthickai/22/orig 2025-12-04T09:43:32.4589275Z * [new branch] gh/karthickai/23/base -> origin/gh/karthickai/23/base 2025-12-04T09:43:32.4591137Z * [new branch] gh/karthickai/23/head -> origin/gh/karthickai/23/head 2025-12-04T09:43:32.4592825Z * [new branch] gh/karthickai/23/orig -> origin/gh/karthickai/23/orig 2025-12-04T09:43:32.4595302Z * [new branch] gh/karthickai/24/base -> origin/gh/karthickai/24/base 2025-12-04T09:43:32.4597028Z * [new branch] gh/karthickai/24/head -> origin/gh/karthickai/24/head 2025-12-04T09:43:32.4599192Z * [new branch] gh/karthickai/24/orig -> origin/gh/karthickai/24/orig 2025-12-04T09:43:32.4601921Z * [new branch] gh/karthickai/25/base -> origin/gh/karthickai/25/base 2025-12-04T09:43:32.4603703Z * [new branch] gh/karthickai/25/head -> origin/gh/karthickai/25/head 2025-12-04T09:43:32.4605389Z * [new branch] gh/karthickai/25/orig -> origin/gh/karthickai/25/orig 2025-12-04T09:43:32.4607652Z * [new branch] gh/karthickai/26/base -> origin/gh/karthickai/26/base 2025-12-04T09:43:32.4609742Z * [new branch] gh/karthickai/26/head -> origin/gh/karthickai/26/head 2025-12-04T09:43:32.4611240Z * [new branch] gh/karthickai/26/orig -> origin/gh/karthickai/26/orig 2025-12-04T09:43:32.4614809Z * [new branch] gh/karthickai/6/base -> origin/gh/karthickai/6/base 2025-12-04T09:43:32.4616914Z * [new branch] gh/karthickai/6/head -> origin/gh/karthickai/6/head 2025-12-04T09:43:32.4618691Z * [new branch] gh/karthickai/6/orig -> origin/gh/karthickai/6/orig 2025-12-04T09:43:32.4621570Z * [new branch] gh/krocki/1/base -> origin/gh/krocki/1/base 2025-12-04T09:43:32.4623295Z * [new branch] gh/krocki/1/head -> origin/gh/krocki/1/head 2025-12-04T09:43:32.4624984Z * [new branch] gh/krocki/1/orig -> origin/gh/krocki/1/orig 2025-12-04T09:43:32.4627435Z * [new branch] gh/krocki/2/base -> origin/gh/krocki/2/base 2025-12-04T09:43:32.4629234Z * [new branch] gh/krocki/2/head -> origin/gh/krocki/2/head 2025-12-04T09:43:32.4631470Z * [new branch] gh/krocki/2/orig -> origin/gh/krocki/2/orig 2025-12-04T09:43:32.4634249Z * [new branch] gh/kurtamohler/60/base -> origin/gh/kurtamohler/60/base 2025-12-04T09:43:32.4635941Z * [new branch] gh/kurtamohler/60/head -> origin/gh/kurtamohler/60/head 2025-12-04T09:43:32.4637622Z * [new branch] gh/kurtamohler/60/orig -> origin/gh/kurtamohler/60/orig 2025-12-04T09:43:32.4639996Z * [new branch] gh/kurtamohler/61/base -> origin/gh/kurtamohler/61/base 2025-12-04T09:43:32.4641773Z * [new branch] gh/kurtamohler/61/head -> origin/gh/kurtamohler/61/head 2025-12-04T09:43:32.4643462Z * [new branch] gh/kurtamohler/61/orig -> origin/gh/kurtamohler/61/orig 2025-12-04T09:43:32.4645818Z * [new branch] gh/kurtamohler/62/base -> origin/gh/kurtamohler/62/base 2025-12-04T09:43:32.4647587Z * [new branch] gh/kurtamohler/62/head -> origin/gh/kurtamohler/62/head 2025-12-04T09:43:32.4649245Z * [new branch] gh/kurtamohler/62/orig -> origin/gh/kurtamohler/62/orig 2025-12-04T09:43:32.4651712Z * [new branch] gh/kurtamohler/63/base -> origin/gh/kurtamohler/63/base 2025-12-04T09:43:32.4653421Z * [new branch] gh/kurtamohler/63/head -> origin/gh/kurtamohler/63/head 2025-12-04T09:43:32.4655130Z * [new branch] gh/kurtamohler/63/orig -> origin/gh/kurtamohler/63/orig 2025-12-04T09:43:32.4657766Z * [new branch] gh/kurtamohler/64/base -> origin/gh/kurtamohler/64/base 2025-12-04T09:43:32.4659453Z * [new branch] gh/kurtamohler/64/head -> origin/gh/kurtamohler/64/head 2025-12-04T09:43:32.4661484Z * [new branch] gh/kurtamohler/64/orig -> origin/gh/kurtamohler/64/orig 2025-12-04T09:43:32.4663928Z * [new branch] gh/kurtamohler/65/base -> origin/gh/kurtamohler/65/base 2025-12-04T09:43:32.4665719Z * [new branch] gh/kurtamohler/65/head -> origin/gh/kurtamohler/65/head 2025-12-04T09:43:32.4667487Z * [new branch] gh/kurtamohler/65/orig -> origin/gh/kurtamohler/65/orig 2025-12-04T09:43:32.4669851Z * [new branch] gh/kurtamohler/66/base -> origin/gh/kurtamohler/66/base 2025-12-04T09:43:32.4671587Z * [new branch] gh/kurtamohler/66/head -> origin/gh/kurtamohler/66/head 2025-12-04T09:43:32.4673306Z * [new branch] gh/kurtamohler/66/orig -> origin/gh/kurtamohler/66/orig 2025-12-04T09:43:32.4675597Z * [new branch] gh/kurtamohler/67/base -> origin/gh/kurtamohler/67/base 2025-12-04T09:43:32.4677276Z * [new branch] gh/kurtamohler/67/head -> origin/gh/kurtamohler/67/head 2025-12-04T09:43:32.4679128Z * [new branch] gh/kurtamohler/67/orig -> origin/gh/kurtamohler/67/orig 2025-12-04T09:43:32.4681855Z * [new branch] gh/kwen2501/130/base -> origin/gh/kwen2501/130/base 2025-12-04T09:43:32.4683679Z * [new branch] gh/kwen2501/130/head -> origin/gh/kwen2501/130/head 2025-12-04T09:43:32.4685410Z * [new branch] gh/kwen2501/130/orig -> origin/gh/kwen2501/130/orig 2025-12-04T09:43:32.4687912Z * [new branch] gh/kwen2501/170/base -> origin/gh/kwen2501/170/base 2025-12-04T09:43:32.4689658Z * [new branch] gh/kwen2501/170/head -> origin/gh/kwen2501/170/head 2025-12-04T09:43:32.4692007Z * [new branch] gh/kwen2501/187/base -> origin/gh/kwen2501/187/base 2025-12-04T09:43:32.4694196Z * [new branch] gh/kwen2501/187/head -> origin/gh/kwen2501/187/head 2025-12-04T09:43:32.4695947Z * [new branch] gh/kwen2501/187/orig -> origin/gh/kwen2501/187/orig 2025-12-04T09:43:32.4698303Z * [new branch] gh/kwen2501/188/base -> origin/gh/kwen2501/188/base 2025-12-04T09:43:32.4700048Z * [new branch] gh/kwen2501/188/head -> origin/gh/kwen2501/188/head 2025-12-04T09:43:32.4701809Z * [new branch] gh/kwen2501/188/orig -> origin/gh/kwen2501/188/orig 2025-12-04T09:43:32.4704660Z * [new branch] gh/kwen2501/211/base -> origin/gh/kwen2501/211/base 2025-12-04T09:43:32.4706376Z * [new branch] gh/kwen2501/211/head -> origin/gh/kwen2501/211/head 2025-12-04T09:43:32.4708786Z * [new branch] gh/kwen2501/224/base -> origin/gh/kwen2501/224/base 2025-12-04T09:43:32.4710741Z * [new branch] gh/kwen2501/224/head -> origin/gh/kwen2501/224/head 2025-12-04T09:43:32.4712583Z * [new branch] gh/kwen2501/224/orig -> origin/gh/kwen2501/224/orig 2025-12-04T09:43:32.4714877Z * [new branch] gh/kwen2501/228/base -> origin/gh/kwen2501/228/base 2025-12-04T09:43:32.4716595Z * [new branch] gh/kwen2501/228/head -> origin/gh/kwen2501/228/head 2025-12-04T09:43:32.4718318Z * [new branch] gh/kwen2501/228/orig -> origin/gh/kwen2501/228/orig 2025-12-04T09:43:32.4720955Z * [new branch] gh/kwen2501/234/base -> origin/gh/kwen2501/234/base 2025-12-04T09:43:32.4722728Z * [new branch] gh/kwen2501/234/head -> origin/gh/kwen2501/234/head 2025-12-04T09:43:32.4724426Z * [new branch] gh/kwen2501/234/orig -> origin/gh/kwen2501/234/orig 2025-12-04T09:43:32.4726774Z * [new branch] gh/kwen2501/235/base -> origin/gh/kwen2501/235/base 2025-12-04T09:43:32.4728577Z * [new branch] gh/kwen2501/235/head -> origin/gh/kwen2501/235/head 2025-12-04T09:43:32.4730363Z * [new branch] gh/kwen2501/235/orig -> origin/gh/kwen2501/235/orig 2025-12-04T09:43:32.4733102Z * [new branch] gh/kwen2501/236/base -> origin/gh/kwen2501/236/base 2025-12-04T09:43:32.4734866Z * [new branch] gh/kwen2501/236/head -> origin/gh/kwen2501/236/head 2025-12-04T09:43:32.4736639Z * [new branch] gh/kwen2501/236/orig -> origin/gh/kwen2501/236/orig 2025-12-04T09:43:32.4738892Z * [new branch] gh/kwen2501/237/base -> origin/gh/kwen2501/237/base 2025-12-04T09:43:32.4740606Z * [new branch] gh/kwen2501/237/head -> origin/gh/kwen2501/237/head 2025-12-04T09:43:32.4742331Z * [new branch] gh/kwen2501/237/orig -> origin/gh/kwen2501/237/orig 2025-12-04T09:43:32.4744652Z * [new branch] gh/kwen2501/238/base -> origin/gh/kwen2501/238/base 2025-12-04T09:43:32.4746356Z * [new branch] gh/kwen2501/238/head -> origin/gh/kwen2501/238/head 2025-12-04T09:43:32.4748999Z * [new branch] gh/kwen2501/238/orig -> origin/gh/kwen2501/238/orig 2025-12-04T09:43:32.4751219Z * [new branch] gh/kwen2501/240/base -> origin/gh/kwen2501/240/base 2025-12-04T09:43:32.4752901Z * [new branch] gh/kwen2501/240/head -> origin/gh/kwen2501/240/head 2025-12-04T09:43:32.4754561Z * [new branch] gh/kwen2501/240/orig -> origin/gh/kwen2501/240/orig 2025-12-04T09:43:32.4757098Z * [new branch] gh/kwen2501/241/base -> origin/gh/kwen2501/241/base 2025-12-04T09:43:32.4758777Z * [new branch] gh/kwen2501/241/head -> origin/gh/kwen2501/241/head 2025-12-04T09:43:32.4760720Z * [new branch] gh/kwen2501/241/orig -> origin/gh/kwen2501/241/orig 2025-12-04T09:43:32.4763104Z * [new branch] gh/kwen2501/247/base -> origin/gh/kwen2501/247/base 2025-12-04T09:43:32.4764810Z * [new branch] gh/kwen2501/247/head -> origin/gh/kwen2501/247/head 2025-12-04T09:43:32.4766528Z * [new branch] gh/kwen2501/247/orig -> origin/gh/kwen2501/247/orig 2025-12-04T09:43:32.4768798Z * [new branch] gh/kwen2501/252/base -> origin/gh/kwen2501/252/base 2025-12-04T09:43:32.4770483Z * [new branch] gh/kwen2501/252/head -> origin/gh/kwen2501/252/head 2025-12-04T09:43:32.4772189Z * [new branch] gh/kwen2501/252/orig -> origin/gh/kwen2501/252/orig 2025-12-04T09:43:32.4775054Z * [new branch] gh/kwen2501/259/base -> origin/gh/kwen2501/259/base 2025-12-04T09:43:32.4776841Z * [new branch] gh/kwen2501/259/head -> origin/gh/kwen2501/259/head 2025-12-04T09:43:32.4778544Z * [new branch] gh/kwen2501/259/orig -> origin/gh/kwen2501/259/orig 2025-12-04T09:43:32.4781002Z * [new branch] gh/kwen2501/260/base -> origin/gh/kwen2501/260/base 2025-12-04T09:43:32.4782723Z * [new branch] gh/kwen2501/260/head -> origin/gh/kwen2501/260/head 2025-12-04T09:43:32.4784608Z * [new branch] gh/kwen2501/260/orig -> origin/gh/kwen2501/260/orig 2025-12-04T09:43:32.4787010Z * [new branch] gh/kwen2501/268/base -> origin/gh/kwen2501/268/base 2025-12-04T09:43:32.4789011Z * [new branch] gh/kwen2501/268/head -> origin/gh/kwen2501/268/head 2025-12-04T09:43:32.4790681Z * [new branch] gh/kwen2501/268/orig -> origin/gh/kwen2501/268/orig 2025-12-04T09:43:32.4793058Z * [new branch] gh/kwen2501/269/base -> origin/gh/kwen2501/269/base 2025-12-04T09:43:32.4794906Z * [new branch] gh/kwen2501/269/head -> origin/gh/kwen2501/269/head 2025-12-04T09:43:32.4796655Z * [new branch] gh/kwen2501/269/orig -> origin/gh/kwen2501/269/orig 2025-12-04T09:43:32.4799052Z * [new branch] gh/kwen2501/270/base -> origin/gh/kwen2501/270/base 2025-12-04T09:43:32.4800871Z * [new branch] gh/kwen2501/270/head -> origin/gh/kwen2501/270/head 2025-12-04T09:43:32.4802578Z * [new branch] gh/kwen2501/270/orig -> origin/gh/kwen2501/270/orig 2025-12-04T09:43:32.4805012Z * [new branch] gh/kwen2501/271/base -> origin/gh/kwen2501/271/base 2025-12-04T09:43:32.4806815Z * [new branch] gh/kwen2501/271/head -> origin/gh/kwen2501/271/head 2025-12-04T09:43:32.4808530Z * [new branch] gh/kwen2501/271/orig -> origin/gh/kwen2501/271/orig 2025-12-04T09:43:32.4810956Z * [new branch] gh/kwen2501/274/base -> origin/gh/kwen2501/274/base 2025-12-04T09:43:32.4812816Z * [new branch] gh/kwen2501/274/head -> origin/gh/kwen2501/274/head 2025-12-04T09:43:32.4814551Z * [new branch] gh/kwen2501/274/orig -> origin/gh/kwen2501/274/orig 2025-12-04T09:43:32.4816958Z * [new branch] gh/kwen2501/275/base -> origin/gh/kwen2501/275/base 2025-12-04T09:43:32.4818779Z * [new branch] gh/kwen2501/275/head -> origin/gh/kwen2501/275/head 2025-12-04T09:43:32.4820664Z * [new branch] gh/kwen2501/275/orig -> origin/gh/kwen2501/275/orig 2025-12-04T09:43:32.4822965Z * [new branch] gh/kwen2501/276/base -> origin/gh/kwen2501/276/base 2025-12-04T09:43:32.4824639Z * [new branch] gh/kwen2501/276/head -> origin/gh/kwen2501/276/head 2025-12-04T09:43:32.4826330Z * [new branch] gh/kwen2501/276/orig -> origin/gh/kwen2501/276/orig 2025-12-04T09:43:32.4828868Z * [new branch] gh/kwen2501/277/base -> origin/gh/kwen2501/277/base 2025-12-04T09:43:32.4830582Z * [new branch] gh/kwen2501/277/head -> origin/gh/kwen2501/277/head 2025-12-04T09:43:32.4832355Z * [new branch] gh/kwen2501/277/orig -> origin/gh/kwen2501/277/orig 2025-12-04T09:43:32.4834811Z * [new branch] gh/kwen2501/278/base -> origin/gh/kwen2501/278/base 2025-12-04T09:43:32.4836559Z * [new branch] gh/kwen2501/278/head -> origin/gh/kwen2501/278/head 2025-12-04T09:43:32.4838270Z * [new branch] gh/kwen2501/278/orig -> origin/gh/kwen2501/278/orig 2025-12-04T09:43:32.4840783Z * [new branch] gh/kwen2501/279/base -> origin/gh/kwen2501/279/base 2025-12-04T09:43:32.4842543Z * [new branch] gh/kwen2501/279/head -> origin/gh/kwen2501/279/head 2025-12-04T09:43:32.4844421Z * [new branch] gh/kwen2501/279/orig -> origin/gh/kwen2501/279/orig 2025-12-04T09:43:32.4846797Z * [new branch] gh/kwen2501/280/base -> origin/gh/kwen2501/280/base 2025-12-04T09:43:32.4848540Z * [new branch] gh/kwen2501/280/head -> origin/gh/kwen2501/280/head 2025-12-04T09:43:32.4850700Z * [new branch] gh/kwen2501/280/orig -> origin/gh/kwen2501/280/orig 2025-12-04T09:43:32.4853138Z * [new branch] gh/kwen2501/281/base -> origin/gh/kwen2501/281/base 2025-12-04T09:43:32.4854853Z * [new branch] gh/kwen2501/281/head -> origin/gh/kwen2501/281/head 2025-12-04T09:43:32.4857069Z * [new branch] gh/kwen2501/281/orig -> origin/gh/kwen2501/281/orig 2025-12-04T09:43:32.4859476Z * [new branch] gh/kwen2501/282/base -> origin/gh/kwen2501/282/base 2025-12-04T09:43:32.4861231Z * [new branch] gh/kwen2501/282/head -> origin/gh/kwen2501/282/head 2025-12-04T09:43:32.4862936Z * [new branch] gh/kwen2501/282/orig -> origin/gh/kwen2501/282/orig 2025-12-04T09:43:32.4865358Z * [new branch] gh/kwen2501/283/base -> origin/gh/kwen2501/283/base 2025-12-04T09:43:32.4867138Z * [new branch] gh/kwen2501/283/head -> origin/gh/kwen2501/283/head 2025-12-04T09:43:32.4868986Z * [new branch] gh/kwen2501/283/orig -> origin/gh/kwen2501/283/orig 2025-12-04T09:43:32.4871437Z * [new branch] gh/kwen2501/284/base -> origin/gh/kwen2501/284/base 2025-12-04T09:43:32.4873093Z * [new branch] gh/kwen2501/284/head -> origin/gh/kwen2501/284/head 2025-12-04T09:43:32.4874886Z * [new branch] gh/kwen2501/284/orig -> origin/gh/kwen2501/284/orig 2025-12-04T09:43:32.4877229Z * [new branch] gh/kwen2501/285/base -> origin/gh/kwen2501/285/base 2025-12-04T09:43:32.4878953Z * [new branch] gh/kwen2501/285/head -> origin/gh/kwen2501/285/head 2025-12-04T09:43:32.4880755Z * [new branch] gh/kwen2501/285/orig -> origin/gh/kwen2501/285/orig 2025-12-04T09:43:32.4883210Z * [new branch] gh/kwen2501/286/base -> origin/gh/kwen2501/286/base 2025-12-04T09:43:32.4885400Z * [new branch] gh/kwen2501/286/head -> origin/gh/kwen2501/286/head 2025-12-04T09:43:32.4887078Z * [new branch] gh/kwen2501/286/orig -> origin/gh/kwen2501/286/orig 2025-12-04T09:43:32.4889345Z * [new branch] gh/kwen2501/287/base -> origin/gh/kwen2501/287/base 2025-12-04T09:43:32.4891687Z * [new branch] gh/kwen2501/287/head -> origin/gh/kwen2501/287/head 2025-12-04T09:43:32.4893249Z * [new branch] gh/kwen2501/287/orig -> origin/gh/kwen2501/287/orig 2025-12-04T09:43:32.4895655Z * [new branch] gh/kwen2501/288/base -> origin/gh/kwen2501/288/base 2025-12-04T09:43:32.4897459Z * [new branch] gh/kwen2501/288/head -> origin/gh/kwen2501/288/head 2025-12-04T09:43:32.4899212Z * [new branch] gh/kwen2501/288/orig -> origin/gh/kwen2501/288/orig 2025-12-04T09:43:32.4901944Z * [new branch] gh/laithsakka/251/base -> origin/gh/laithsakka/251/base 2025-12-04T09:43:32.4903818Z * [new branch] gh/laithsakka/251/head -> origin/gh/laithsakka/251/head 2025-12-04T09:43:32.4905538Z * [new branch] gh/laithsakka/251/orig -> origin/gh/laithsakka/251/orig 2025-12-04T09:43:32.4907967Z * [new branch] gh/laithsakka/276/base -> origin/gh/laithsakka/276/base 2025-12-04T09:43:32.4909624Z * [new branch] gh/laithsakka/276/head -> origin/gh/laithsakka/276/head 2025-12-04T09:43:32.4911314Z * [new branch] gh/laithsakka/276/orig -> origin/gh/laithsakka/276/orig 2025-12-04T09:43:32.4913779Z * [new branch] gh/laithsakka/28/base -> origin/gh/laithsakka/28/base 2025-12-04T09:43:32.4915965Z * [new branch] gh/laithsakka/29/base -> origin/gh/laithsakka/29/base 2025-12-04T09:43:32.4918112Z * [new branch] gh/laithsakka/30/base -> origin/gh/laithsakka/30/base 2025-12-04T09:43:32.4919837Z * [new branch] gh/laithsakka/30/head -> origin/gh/laithsakka/30/head 2025-12-04T09:43:32.4921948Z * [new branch] gh/laithsakka/31/base -> origin/gh/laithsakka/31/base 2025-12-04T09:43:32.4923571Z * [new branch] gh/laithsakka/31/head -> origin/gh/laithsakka/31/head 2025-12-04T09:43:32.4926081Z * [new branch] gh/laithsakka/313/base -> origin/gh/laithsakka/313/base 2025-12-04T09:43:32.4927764Z * [new branch] gh/laithsakka/313/head -> origin/gh/laithsakka/313/head 2025-12-04T09:43:32.4929450Z * [new branch] gh/laithsakka/313/orig -> origin/gh/laithsakka/313/orig 2025-12-04T09:43:32.4931932Z * [new branch] gh/laithsakka/316/base -> origin/gh/laithsakka/316/base 2025-12-04T09:43:32.4933573Z * [new branch] gh/laithsakka/316/head -> origin/gh/laithsakka/316/head 2025-12-04T09:43:32.4935311Z * [new branch] gh/laithsakka/316/orig -> origin/gh/laithsakka/316/orig 2025-12-04T09:43:32.4938274Z * [new branch] gh/laithsakka/317/base -> origin/gh/laithsakka/317/base 2025-12-04T09:43:32.4940009Z * [new branch] gh/laithsakka/317/head -> origin/gh/laithsakka/317/head 2025-12-04T09:43:32.4952846Z * [new branch] gh/laithsakka/317/orig -> origin/gh/laithsakka/317/orig 2025-12-04T09:43:32.4953450Z * [new branch] gh/laithsakka/319/base -> origin/gh/laithsakka/319/base 2025-12-04T09:43:32.4954265Z * [new branch] gh/laithsakka/319/head -> origin/gh/laithsakka/319/head 2025-12-04T09:43:32.4954785Z * [new branch] gh/laithsakka/319/orig -> origin/gh/laithsakka/319/orig 2025-12-04T09:43:32.4955662Z * [new branch] gh/laithsakka/32/base -> origin/gh/laithsakka/32/base 2025-12-04T09:43:32.4956272Z * [new branch] gh/laithsakka/32/head -> origin/gh/laithsakka/32/head 2025-12-04T09:43:32.4956786Z * [new branch] gh/laithsakka/320/base -> origin/gh/laithsakka/320/base 2025-12-04T09:43:32.4957266Z * [new branch] gh/laithsakka/320/head -> origin/gh/laithsakka/320/head 2025-12-04T09:43:32.4959800Z * [new branch] gh/laithsakka/320/orig -> origin/gh/laithsakka/320/orig 2025-12-04T09:43:32.4961960Z * [new branch] gh/laithsakka/321/base -> origin/gh/laithsakka/321/base 2025-12-04T09:43:32.4963957Z * [new branch] gh/laithsakka/321/head -> origin/gh/laithsakka/321/head 2025-12-04T09:43:32.4965349Z * [new branch] gh/laithsakka/321/orig -> origin/gh/laithsakka/321/orig 2025-12-04T09:43:32.4967918Z * [new branch] gh/laithsakka/322/base -> origin/gh/laithsakka/322/base 2025-12-04T09:43:32.4969709Z * [new branch] gh/laithsakka/322/head -> origin/gh/laithsakka/322/head 2025-12-04T09:43:32.4971493Z * [new branch] gh/laithsakka/322/orig -> origin/gh/laithsakka/322/orig 2025-12-04T09:43:32.4973948Z * [new branch] gh/laithsakka/323/base -> origin/gh/laithsakka/323/base 2025-12-04T09:43:32.4975727Z * [new branch] gh/laithsakka/323/head -> origin/gh/laithsakka/323/head 2025-12-04T09:43:32.4977496Z * [new branch] gh/laithsakka/323/orig -> origin/gh/laithsakka/323/orig 2025-12-04T09:43:32.4979909Z * [new branch] gh/laithsakka/324/base -> origin/gh/laithsakka/324/base 2025-12-04T09:43:32.4981556Z * [new branch] gh/laithsakka/324/head -> origin/gh/laithsakka/324/head 2025-12-04T09:43:32.4983198Z * [new branch] gh/laithsakka/324/orig -> origin/gh/laithsakka/324/orig 2025-12-04T09:43:32.4985506Z * [new branch] gh/laithsakka/325/base -> origin/gh/laithsakka/325/base 2025-12-04T09:43:32.4987405Z * [new branch] gh/laithsakka/325/head -> origin/gh/laithsakka/325/head 2025-12-04T09:43:32.4989119Z * [new branch] gh/laithsakka/325/orig -> origin/gh/laithsakka/325/orig 2025-12-04T09:43:32.4991641Z * [new branch] gh/laithsakka/326/base -> origin/gh/laithsakka/326/base 2025-12-04T09:43:32.4993423Z * [new branch] gh/laithsakka/326/head -> origin/gh/laithsakka/326/head 2025-12-04T09:43:32.4995222Z * [new branch] gh/laithsakka/326/orig -> origin/gh/laithsakka/326/orig 2025-12-04T09:43:32.4997586Z * [new branch] gh/laithsakka/327/base -> origin/gh/laithsakka/327/base 2025-12-04T09:43:32.4999379Z * [new branch] gh/laithsakka/327/head -> origin/gh/laithsakka/327/head 2025-12-04T09:43:32.5001187Z * [new branch] gh/laithsakka/327/orig -> origin/gh/laithsakka/327/orig 2025-12-04T09:43:32.5003532Z * [new branch] gh/laithsakka/328/base -> origin/gh/laithsakka/328/base 2025-12-04T09:43:32.5005287Z * [new branch] gh/laithsakka/328/head -> origin/gh/laithsakka/328/head 2025-12-04T09:43:32.5006972Z * [new branch] gh/laithsakka/328/orig -> origin/gh/laithsakka/328/orig 2025-12-04T09:43:32.5009741Z * [new branch] gh/liangel/4/base -> origin/gh/liangel/4/base 2025-12-04T09:43:32.5011507Z * [new branch] gh/liangel/4/head -> origin/gh/liangel/4/head 2025-12-04T09:43:32.5013210Z * [new branch] gh/liangel/4/orig -> origin/gh/liangel/4/orig 2025-12-04T09:43:32.5017417Z * [new branch] gh/lucaskabela/1/base -> origin/gh/lucaskabela/1/base 2025-12-04T09:43:32.5019176Z * [new branch] gh/lucaskabela/1/head -> origin/gh/lucaskabela/1/head 2025-12-04T09:43:32.5021916Z * [new branch] gh/lw/4/base -> origin/gh/lw/4/base 2025-12-04T09:43:32.5023668Z * [new branch] gh/lw/4/head -> origin/gh/lw/4/head 2025-12-04T09:43:32.5025387Z * [new branch] gh/lw/4/orig -> origin/gh/lw/4/orig 2025-12-04T09:43:32.5027801Z * [new branch] gh/lw/5/base -> origin/gh/lw/5/base 2025-12-04T09:43:32.5029576Z * [new branch] gh/lw/5/head -> origin/gh/lw/5/head 2025-12-04T09:43:32.5031241Z * [new branch] gh/lw/5/orig -> origin/gh/lw/5/orig 2025-12-04T09:43:32.5033501Z * [new branch] gh/lw/6/base -> origin/gh/lw/6/base 2025-12-04T09:43:32.5035447Z * [new branch] gh/lw/6/head -> origin/gh/lw/6/head 2025-12-04T09:43:32.5037053Z * [new branch] gh/lw/6/orig -> origin/gh/lw/6/orig 2025-12-04T09:43:32.5039788Z * [new branch] gh/malfet/14/base -> origin/gh/malfet/14/base 2025-12-04T09:43:32.5042201Z * [new branch] gh/malfet/417/base -> origin/gh/malfet/417/base 2025-12-04T09:43:32.5043992Z * [new branch] gh/malfet/417/head -> origin/gh/malfet/417/head 2025-12-04T09:43:32.5045554Z * [new branch] gh/malfet/417/orig -> origin/gh/malfet/417/orig 2025-12-04T09:43:32.5047933Z * [new branch] gh/malfet/506/base -> origin/gh/malfet/506/base 2025-12-04T09:43:32.5049679Z * [new branch] gh/malfet/506/head -> origin/gh/malfet/506/head 2025-12-04T09:43:32.5051431Z * [new branch] gh/malfet/506/orig -> origin/gh/malfet/506/orig 2025-12-04T09:43:32.5053689Z * [new branch] gh/malfet/517/base -> origin/gh/malfet/517/base 2025-12-04T09:43:32.5055640Z * [new branch] gh/malfet/517/head -> origin/gh/malfet/517/head 2025-12-04T09:43:32.5057950Z * [new branch] gh/malfet/528/base -> origin/gh/malfet/528/base 2025-12-04T09:43:32.5059699Z * [new branch] gh/malfet/528/head -> origin/gh/malfet/528/head 2025-12-04T09:43:32.5061393Z * [new branch] gh/malfet/528/orig -> origin/gh/malfet/528/orig 2025-12-04T09:43:32.5063785Z * [new branch] gh/malfet/537/base -> origin/gh/malfet/537/base 2025-12-04T09:43:32.5065469Z * [new branch] gh/malfet/537/head -> origin/gh/malfet/537/head 2025-12-04T09:43:32.5067482Z * [new branch] gh/malfet/537/orig -> origin/gh/malfet/537/orig 2025-12-04T09:43:32.5069673Z * [new branch] gh/malfet/546/base -> origin/gh/malfet/546/base 2025-12-04T09:43:32.5071374Z * [new branch] gh/malfet/546/head -> origin/gh/malfet/546/head 2025-12-04T09:43:32.5073052Z * [new branch] gh/malfet/546/orig -> origin/gh/malfet/546/orig 2025-12-04T09:43:32.5075425Z * [new branch] gh/malfet/565/base -> origin/gh/malfet/565/base 2025-12-04T09:43:32.5077161Z * [new branch] gh/malfet/565/head -> origin/gh/malfet/565/head 2025-12-04T09:43:32.5078887Z * [new branch] gh/malfet/565/orig -> origin/gh/malfet/565/orig 2025-12-04T09:43:32.5081156Z * [new branch] gh/malfet/575/base -> origin/gh/malfet/575/base 2025-12-04T09:43:32.5082920Z * [new branch] gh/malfet/575/head -> origin/gh/malfet/575/head 2025-12-04T09:43:32.5084608Z * [new branch] gh/malfet/575/orig -> origin/gh/malfet/575/orig 2025-12-04T09:43:32.5086959Z * [new branch] gh/malfet/580/base -> origin/gh/malfet/580/base 2025-12-04T09:43:32.5088731Z * [new branch] gh/malfet/580/head -> origin/gh/malfet/580/head 2025-12-04T09:43:32.5090531Z * [new branch] gh/malfet/580/orig -> origin/gh/malfet/580/orig 2025-12-04T09:43:32.5092709Z * [new branch] gh/malfet/581/base -> origin/gh/malfet/581/base 2025-12-04T09:43:32.5094436Z * [new branch] gh/malfet/581/head -> origin/gh/malfet/581/head 2025-12-04T09:43:32.5096153Z * [new branch] gh/malfet/581/orig -> origin/gh/malfet/581/orig 2025-12-04T09:43:32.5098374Z * [new branch] gh/malfet/583/base -> origin/gh/malfet/583/base 2025-12-04T09:43:32.5100099Z * [new branch] gh/malfet/583/head -> origin/gh/malfet/583/head 2025-12-04T09:43:32.5101780Z * [new branch] gh/malfet/583/orig -> origin/gh/malfet/583/orig 2025-12-04T09:43:32.5104012Z * [new branch] gh/malfet/586/base -> origin/gh/malfet/586/base 2025-12-04T09:43:32.5105947Z * [new branch] gh/malfet/586/head -> origin/gh/malfet/586/head 2025-12-04T09:43:32.5107519Z * [new branch] gh/malfet/586/orig -> origin/gh/malfet/586/orig 2025-12-04T09:43:32.5109886Z * [new branch] gh/malfet/587/base -> origin/gh/malfet/587/base 2025-12-04T09:43:32.5111618Z * [new branch] gh/malfet/587/head -> origin/gh/malfet/587/head 2025-12-04T09:43:32.5113391Z * [new branch] gh/malfet/587/orig -> origin/gh/malfet/587/orig 2025-12-04T09:43:32.5115609Z * [new branch] gh/malfet/588/base -> origin/gh/malfet/588/base 2025-12-04T09:43:32.5117306Z * [new branch] gh/malfet/588/head -> origin/gh/malfet/588/head 2025-12-04T09:43:32.5119110Z * [new branch] gh/malfet/588/orig -> origin/gh/malfet/588/orig 2025-12-04T09:43:32.5121874Z * [new branch] gh/malfet/589/base -> origin/gh/malfet/589/base 2025-12-04T09:43:32.5123592Z * [new branch] gh/malfet/589/head -> origin/gh/malfet/589/head 2025-12-04T09:43:32.5125753Z * [new branch] gh/malfet/589/orig -> origin/gh/malfet/589/orig 2025-12-04T09:43:32.5128048Z * [new branch] gh/malfet/590/base -> origin/gh/malfet/590/base 2025-12-04T09:43:32.5129849Z * [new branch] gh/malfet/590/head -> origin/gh/malfet/590/head 2025-12-04T09:43:32.5131614Z * [new branch] gh/malfet/590/orig -> origin/gh/malfet/590/orig 2025-12-04T09:43:32.5134453Z * [new branch] gh/malfet/591/base -> origin/gh/malfet/591/base 2025-12-04T09:43:32.5136236Z * [new branch] gh/malfet/591/head -> origin/gh/malfet/591/head 2025-12-04T09:43:32.5137974Z * [new branch] gh/malfet/591/orig -> origin/gh/malfet/591/orig 2025-12-04T09:43:32.5140280Z * [new branch] gh/malfet/592/base -> origin/gh/malfet/592/base 2025-12-04T09:43:32.5142025Z * [new branch] gh/malfet/592/head -> origin/gh/malfet/592/head 2025-12-04T09:43:32.5143900Z * [new branch] gh/malfet/592/orig -> origin/gh/malfet/592/orig 2025-12-04T09:43:32.5146198Z * [new branch] gh/malfet/593/base -> origin/gh/malfet/593/base 2025-12-04T09:43:32.5148006Z * [new branch] gh/malfet/593/head -> origin/gh/malfet/593/head 2025-12-04T09:43:32.5149737Z * [new branch] gh/malfet/593/orig -> origin/gh/malfet/593/orig 2025-12-04T09:43:32.5152154Z * [new branch] gh/malfet/594/base -> origin/gh/malfet/594/base 2025-12-04T09:43:32.5153878Z * [new branch] gh/malfet/594/head -> origin/gh/malfet/594/head 2025-12-04T09:43:32.5155578Z * [new branch] gh/malfet/594/orig -> origin/gh/malfet/594/orig 2025-12-04T09:43:32.5158096Z * [new branch] gh/malfet/595/base -> origin/gh/malfet/595/base 2025-12-04T09:43:32.5159897Z * [new branch] gh/malfet/595/head -> origin/gh/malfet/595/head 2025-12-04T09:43:32.5161575Z * [new branch] gh/malfet/595/orig -> origin/gh/malfet/595/orig 2025-12-04T09:43:32.5163906Z * [new branch] gh/malfet/596/base -> origin/gh/malfet/596/base 2025-12-04T09:43:32.5165638Z * [new branch] gh/malfet/596/head -> origin/gh/malfet/596/head 2025-12-04T09:43:32.5167396Z * [new branch] gh/malfet/596/orig -> origin/gh/malfet/596/orig 2025-12-04T09:43:32.5169680Z * [new branch] gh/malfet/597/base -> origin/gh/malfet/597/base 2025-12-04T09:43:32.5171475Z * [new branch] gh/malfet/597/head -> origin/gh/malfet/597/head 2025-12-04T09:43:32.5173678Z * [new branch] gh/malfet/597/orig -> origin/gh/malfet/597/orig 2025-12-04T09:43:32.5175905Z * [new branch] gh/malfet/598/base -> origin/gh/malfet/598/base 2025-12-04T09:43:32.5177775Z * [new branch] gh/malfet/598/head -> origin/gh/malfet/598/head 2025-12-04T09:43:32.5179206Z * [new branch] gh/malfet/598/orig -> origin/gh/malfet/598/orig 2025-12-04T09:43:32.5181683Z * [new branch] gh/malfet/599/base -> origin/gh/malfet/599/base 2025-12-04T09:43:32.5183362Z * [new branch] gh/malfet/599/head -> origin/gh/malfet/599/head 2025-12-04T09:43:32.5185104Z * [new branch] gh/malfet/599/orig -> origin/gh/malfet/599/orig 2025-12-04T09:43:32.5187504Z * [new branch] gh/malfet/600/base -> origin/gh/malfet/600/base 2025-12-04T09:43:32.5189201Z * [new branch] gh/malfet/600/head -> origin/gh/malfet/600/head 2025-12-04T09:43:32.5190977Z * [new branch] gh/malfet/600/orig -> origin/gh/malfet/600/orig 2025-12-04T09:43:32.5193442Z * [new branch] gh/malfet/601/base -> origin/gh/malfet/601/base 2025-12-04T09:43:32.5195160Z * [new branch] gh/malfet/601/head -> origin/gh/malfet/601/head 2025-12-04T09:43:32.5197296Z * [new branch] gh/malfet/601/orig -> origin/gh/malfet/601/orig 2025-12-04T09:43:32.5199809Z * [new branch] gh/malfet/602/base -> origin/gh/malfet/602/base 2025-12-04T09:43:32.5201471Z * [new branch] gh/malfet/602/head -> origin/gh/malfet/602/head 2025-12-04T09:43:32.5203186Z * [new branch] gh/malfet/602/orig -> origin/gh/malfet/602/orig 2025-12-04T09:43:32.5205638Z * [new branch] gh/malfet/603/base -> origin/gh/malfet/603/base 2025-12-04T09:43:32.5207239Z * [new branch] gh/malfet/603/head -> origin/gh/malfet/603/head 2025-12-04T09:43:32.5209012Z * [new branch] gh/malfet/603/orig -> origin/gh/malfet/603/orig 2025-12-04T09:43:32.5211382Z * [new branch] gh/malfet/604/base -> origin/gh/malfet/604/base 2025-12-04T09:43:32.5213085Z * [new branch] gh/malfet/604/head -> origin/gh/malfet/604/head 2025-12-04T09:43:32.5214778Z * [new branch] gh/malfet/604/orig -> origin/gh/malfet/604/orig 2025-12-04T09:43:32.5217147Z * [new branch] gh/malfet/605/base -> origin/gh/malfet/605/base 2025-12-04T09:43:32.5218844Z * [new branch] gh/malfet/605/head -> origin/gh/malfet/605/head 2025-12-04T09:43:32.5220609Z * [new branch] gh/malfet/605/orig -> origin/gh/malfet/605/orig 2025-12-04T09:43:32.5223046Z * [new branch] gh/malfet/606/base -> origin/gh/malfet/606/base 2025-12-04T09:43:32.5224797Z * [new branch] gh/malfet/606/head -> origin/gh/malfet/606/head 2025-12-04T09:43:32.5226488Z * [new branch] gh/malfet/606/orig -> origin/gh/malfet/606/orig 2025-12-04T09:43:32.5229098Z * [new branch] gh/malfet/607/base -> origin/gh/malfet/607/base 2025-12-04T09:43:32.5230816Z * [new branch] gh/malfet/607/head -> origin/gh/malfet/607/head 2025-12-04T09:43:32.5232568Z * [new branch] gh/malfet/607/orig -> origin/gh/malfet/607/orig 2025-12-04T09:43:32.5234956Z * [new branch] gh/malfet/608/base -> origin/gh/malfet/608/base 2025-12-04T09:43:32.5236689Z * [new branch] gh/malfet/608/head -> origin/gh/malfet/608/head 2025-12-04T09:43:32.5238392Z * [new branch] gh/malfet/608/orig -> origin/gh/malfet/608/orig 2025-12-04T09:43:32.5240751Z * [new branch] gh/malfet/609/base -> origin/gh/malfet/609/base 2025-12-04T09:43:32.5242480Z * [new branch] gh/malfet/609/head -> origin/gh/malfet/609/head 2025-12-04T09:43:32.5244177Z * [new branch] gh/malfet/609/orig -> origin/gh/malfet/609/orig 2025-12-04T09:43:32.5246774Z * [new branch] gh/malfet/610/base -> origin/gh/malfet/610/base 2025-12-04T09:43:32.5248459Z * [new branch] gh/malfet/610/head -> origin/gh/malfet/610/head 2025-12-04T09:43:32.5250174Z * [new branch] gh/malfet/610/orig -> origin/gh/malfet/610/orig 2025-12-04T09:43:32.5252585Z * [new branch] gh/malfet/611/base -> origin/gh/malfet/611/base 2025-12-04T09:43:32.5254269Z * [new branch] gh/malfet/611/head -> origin/gh/malfet/611/head 2025-12-04T09:43:32.5257975Z * [new branch] gh/malfet/611/orig -> origin/gh/malfet/611/orig 2025-12-04T09:43:32.5260318Z * [new branch] gh/malfet/612/base -> origin/gh/malfet/612/base 2025-12-04T09:43:32.5262080Z * [new branch] gh/malfet/612/head -> origin/gh/malfet/612/head 2025-12-04T09:43:32.5263863Z * [new branch] gh/malfet/612/orig -> origin/gh/malfet/612/orig 2025-12-04T09:43:32.5266239Z * [new branch] gh/malfet/64/base -> origin/gh/malfet/64/base 2025-12-04T09:43:32.5268661Z * [new branch] gh/malfet/64/head -> origin/gh/malfet/64/head 2025-12-04T09:43:32.5271472Z * [new branch] gh/manuelcandales/11/base -> origin/gh/manuelcandales/11/base 2025-12-04T09:43:32.5273313Z * [new branch] gh/manuelcandales/11/head -> origin/gh/manuelcandales/11/head 2025-12-04T09:43:32.5274997Z * [new branch] gh/manuelcandales/11/orig -> origin/gh/manuelcandales/11/orig 2025-12-04T09:43:32.5278062Z * [new branch] gh/markkm/1/base -> origin/gh/markkm/1/base 2025-12-04T09:43:32.5280816Z * [new branch] gh/masnesral/1/base -> origin/gh/masnesral/1/base 2025-12-04T09:43:32.5282557Z * [new branch] gh/masnesral/1/head -> origin/gh/masnesral/1/head 2025-12-04T09:43:32.5284258Z * [new branch] gh/masnesral/1/orig -> origin/gh/masnesral/1/orig 2025-12-04T09:43:32.5287159Z * [new branch] gh/mhorowitz/0/base -> origin/gh/mhorowitz/0/base 2025-12-04T09:43:32.5288956Z * [new branch] gh/mhorowitz/0/head -> origin/gh/mhorowitz/0/head 2025-12-04T09:43:32.5291202Z * [new branch] gh/mhorowitz/1/base -> origin/gh/mhorowitz/1/base 2025-12-04T09:43:32.5292922Z * [new branch] gh/mhorowitz/1/head -> origin/gh/mhorowitz/1/head 2025-12-04T09:43:32.5295102Z * [new branch] gh/mhorowitz/2/base -> origin/gh/mhorowitz/2/base 2025-12-04T09:43:32.5296924Z * [new branch] gh/mhorowitz/2/head -> origin/gh/mhorowitz/2/head 2025-12-04T09:43:32.5299098Z * [new branch] gh/mhorowitz/3/base -> origin/gh/mhorowitz/3/base 2025-12-04T09:43:32.5300719Z * [new branch] gh/mhorowitz/3/head -> origin/gh/mhorowitz/3/head 2025-12-04T09:43:32.5302828Z * [new branch] gh/mhorowitz/4/base -> origin/gh/mhorowitz/4/base 2025-12-04T09:43:32.5304531Z * [new branch] gh/mhorowitz/4/head -> origin/gh/mhorowitz/4/head 2025-12-04T09:43:32.5306647Z * [new branch] gh/mhorowitz/5/base -> origin/gh/mhorowitz/5/base 2025-12-04T09:43:32.5308480Z * [new branch] gh/mhorowitz/5/head -> origin/gh/mhorowitz/5/head 2025-12-04T09:43:32.5310640Z * [new branch] gh/mhorowitz/6/base -> origin/gh/mhorowitz/6/base 2025-12-04T09:43:32.5312313Z * [new branch] gh/mhorowitz/6/head -> origin/gh/mhorowitz/6/head 2025-12-04T09:43:32.5315219Z * [new branch] gh/mikaylagawarecki/234/base -> origin/gh/mikaylagawarecki/234/base 2025-12-04T09:43:32.5317021Z * [new branch] gh/mikaylagawarecki/234/head -> origin/gh/mikaylagawarecki/234/head 2025-12-04T09:43:32.5319300Z * [new branch] gh/mikaylagawarecki/235/base -> origin/gh/mikaylagawarecki/235/base 2025-12-04T09:43:32.5321191Z * [new branch] gh/mikaylagawarecki/235/head -> origin/gh/mikaylagawarecki/235/head 2025-12-04T09:43:32.5323292Z * [new branch] gh/mikaylagawarecki/236/base -> origin/gh/mikaylagawarecki/236/base 2025-12-04T09:43:32.5325000Z * [new branch] gh/mikaylagawarecki/236/head -> origin/gh/mikaylagawarecki/236/head 2025-12-04T09:43:32.5327240Z * [new branch] gh/mikaylagawarecki/237/base -> origin/gh/mikaylagawarecki/237/base 2025-12-04T09:43:32.5328861Z * [new branch] gh/mikaylagawarecki/237/head -> origin/gh/mikaylagawarecki/237/head 2025-12-04T09:43:32.5331143Z * [new branch] gh/mikaylagawarecki/238/base -> origin/gh/mikaylagawarecki/238/base 2025-12-04T09:43:32.5332798Z * [new branch] gh/mikaylagawarecki/238/head -> origin/gh/mikaylagawarecki/238/head 2025-12-04T09:43:32.5335149Z * [new branch] gh/mikaylagawarecki/336/base -> origin/gh/mikaylagawarecki/336/base 2025-12-04T09:43:32.5336865Z * [new branch] gh/mikaylagawarecki/336/head -> origin/gh/mikaylagawarecki/336/head 2025-12-04T09:43:32.5338635Z * [new branch] gh/mikaylagawarecki/336/orig -> origin/gh/mikaylagawarecki/336/orig 2025-12-04T09:43:32.5341045Z * [new branch] gh/mikaylagawarecki/341/base -> origin/gh/mikaylagawarecki/341/base 2025-12-04T09:43:32.5342707Z * [new branch] gh/mikaylagawarecki/341/head -> origin/gh/mikaylagawarecki/341/head 2025-12-04T09:43:32.5344485Z * [new branch] gh/mikaylagawarecki/341/orig -> origin/gh/mikaylagawarecki/341/orig 2025-12-04T09:43:32.5346885Z * [new branch] gh/mikaylagawarecki/342/base -> origin/gh/mikaylagawarecki/342/base 2025-12-04T09:43:32.5348798Z * [new branch] gh/mikaylagawarecki/342/head -> origin/gh/mikaylagawarecki/342/head 2025-12-04T09:43:32.5350528Z * [new branch] gh/mikaylagawarecki/342/orig -> origin/gh/mikaylagawarecki/342/orig 2025-12-04T09:43:32.5352869Z * [new branch] gh/mikaylagawarecki/345/base -> origin/gh/mikaylagawarecki/345/base 2025-12-04T09:43:32.5354487Z * [new branch] gh/mikaylagawarecki/345/head -> origin/gh/mikaylagawarecki/345/head 2025-12-04T09:43:32.5358138Z * [new branch] gh/mikaylagawarecki/345/orig -> origin/gh/mikaylagawarecki/345/orig 2025-12-04T09:43:32.5361024Z * [new branch] gh/mikaylagawarecki/346/base -> origin/gh/mikaylagawarecki/346/base 2025-12-04T09:43:32.5362761Z * [new branch] gh/mikaylagawarecki/346/head -> origin/gh/mikaylagawarecki/346/head 2025-12-04T09:43:32.5364595Z * [new branch] gh/mikaylagawarecki/346/orig -> origin/gh/mikaylagawarecki/346/orig 2025-12-04T09:43:32.5367028Z * [new branch] gh/mikaylagawarecki/347/base -> origin/gh/mikaylagawarecki/347/base 2025-12-04T09:43:32.5368621Z * [new branch] gh/mikaylagawarecki/347/head -> origin/gh/mikaylagawarecki/347/head 2025-12-04T09:43:32.5370347Z * [new branch] gh/mikaylagawarecki/347/orig -> origin/gh/mikaylagawarecki/347/orig 2025-12-04T09:43:32.5372750Z * [new branch] gh/mikaylagawarecki/350/base -> origin/gh/mikaylagawarecki/350/base 2025-12-04T09:43:32.5374452Z * [new branch] gh/mikaylagawarecki/350/head -> origin/gh/mikaylagawarecki/350/head 2025-12-04T09:43:32.5376217Z * [new branch] gh/mikaylagawarecki/350/orig -> origin/gh/mikaylagawarecki/350/orig 2025-12-04T09:43:32.5378847Z * [new branch] gh/mikaylagawarecki/351/base -> origin/gh/mikaylagawarecki/351/base 2025-12-04T09:43:32.5380625Z * [new branch] gh/mikaylagawarecki/351/head -> origin/gh/mikaylagawarecki/351/head 2025-12-04T09:43:32.5382348Z * [new branch] gh/mikaylagawarecki/351/orig -> origin/gh/mikaylagawarecki/351/orig 2025-12-04T09:43:32.5384822Z * [new branch] gh/mikaylagawarecki/352/base -> origin/gh/mikaylagawarecki/352/base 2025-12-04T09:43:32.5386631Z * [new branch] gh/mikaylagawarecki/352/head -> origin/gh/mikaylagawarecki/352/head 2025-12-04T09:43:32.5388691Z * [new branch] gh/mikaylagawarecki/352/orig -> origin/gh/mikaylagawarecki/352/orig 2025-12-04T09:43:32.5391446Z * [new branch] gh/mikaylagawarecki/353/base -> origin/gh/mikaylagawarecki/353/base 2025-12-04T09:43:32.5393421Z * [new branch] gh/mikaylagawarecki/353/head -> origin/gh/mikaylagawarecki/353/head 2025-12-04T09:43:32.5395083Z * [new branch] gh/mikaylagawarecki/353/orig -> origin/gh/mikaylagawarecki/353/orig 2025-12-04T09:43:32.5397274Z * [new branch] gh/mikaylagawarecki/354/base -> origin/gh/mikaylagawarecki/354/base 2025-12-04T09:43:32.5398991Z * [new branch] gh/mikaylagawarecki/354/head -> origin/gh/mikaylagawarecki/354/head 2025-12-04T09:43:32.5400719Z * [new branch] gh/mikaylagawarecki/354/orig -> origin/gh/mikaylagawarecki/354/orig 2025-12-04T09:43:32.5403588Z * [new branch] gh/mikaylagawarecki/356/base -> origin/gh/mikaylagawarecki/356/base 2025-12-04T09:43:32.5405340Z * [new branch] gh/mikaylagawarecki/356/head -> origin/gh/mikaylagawarecki/356/head 2025-12-04T09:43:32.5407527Z * [new branch] gh/mikaylagawarecki/356/orig -> origin/gh/mikaylagawarecki/356/orig 2025-12-04T09:43:32.5409807Z * [new branch] gh/mikaylagawarecki/357/base -> origin/gh/mikaylagawarecki/357/base 2025-12-04T09:43:32.5411495Z * [new branch] gh/mikaylagawarecki/357/head -> origin/gh/mikaylagawarecki/357/head 2025-12-04T09:43:32.5413312Z * [new branch] gh/mikaylagawarecki/357/orig -> origin/gh/mikaylagawarecki/357/orig 2025-12-04T09:43:32.5416259Z * [new branch] gh/mikaylagawarecki/359/base -> origin/gh/mikaylagawarecki/359/base 2025-12-04T09:43:32.5418155Z * [new branch] gh/mikaylagawarecki/359/head -> origin/gh/mikaylagawarecki/359/head 2025-12-04T09:43:32.5419882Z * [new branch] gh/mikaylagawarecki/359/orig -> origin/gh/mikaylagawarecki/359/orig 2025-12-04T09:43:32.5422235Z * [new branch] gh/mikaylagawarecki/360/base -> origin/gh/mikaylagawarecki/360/base 2025-12-04T09:43:32.5424049Z * [new branch] gh/mikaylagawarecki/360/head -> origin/gh/mikaylagawarecki/360/head 2025-12-04T09:43:32.5425763Z * [new branch] gh/mikaylagawarecki/360/orig -> origin/gh/mikaylagawarecki/360/orig 2025-12-04T09:43:32.5428419Z * [new branch] gh/mikaylagawarecki/361/base -> origin/gh/mikaylagawarecki/361/base 2025-12-04T09:43:32.5430128Z * [new branch] gh/mikaylagawarecki/361/head -> origin/gh/mikaylagawarecki/361/head 2025-12-04T09:43:32.5431835Z * [new branch] gh/mikaylagawarecki/361/orig -> origin/gh/mikaylagawarecki/361/orig 2025-12-04T09:43:32.5434289Z * [new branch] gh/mikaylagawarecki/362/base -> origin/gh/mikaylagawarecki/362/base 2025-12-04T09:43:32.5436193Z * [new branch] gh/mikaylagawarecki/362/head -> origin/gh/mikaylagawarecki/362/head 2025-12-04T09:43:32.5437953Z * [new branch] gh/mikaylagawarecki/362/orig -> origin/gh/mikaylagawarecki/362/orig 2025-12-04T09:43:32.5440582Z * [new branch] gh/mikaylagawarecki/363/base -> origin/gh/mikaylagawarecki/363/base 2025-12-04T09:43:32.5442558Z * [new branch] gh/mikaylagawarecki/363/head -> origin/gh/mikaylagawarecki/363/head 2025-12-04T09:43:32.5444247Z * [new branch] gh/mikaylagawarecki/363/orig -> origin/gh/mikaylagawarecki/363/orig 2025-12-04T09:43:32.5447012Z * [new branch] gh/mikaylagawarecki/364/base -> origin/gh/mikaylagawarecki/364/base 2025-12-04T09:43:32.5448814Z * [new branch] gh/mikaylagawarecki/364/head -> origin/gh/mikaylagawarecki/364/head 2025-12-04T09:43:32.5450527Z * [new branch] gh/mikaylagawarecki/364/orig -> origin/gh/mikaylagawarecki/364/orig 2025-12-04T09:43:32.5452994Z * [new branch] gh/mikaylagawarecki/365/base -> origin/gh/mikaylagawarecki/365/base 2025-12-04T09:43:32.5454831Z * [new branch] gh/mikaylagawarecki/365/head -> origin/gh/mikaylagawarecki/365/head 2025-12-04T09:43:32.5456845Z * [new branch] gh/mikaylagawarecki/365/orig -> origin/gh/mikaylagawarecki/365/orig 2025-12-04T09:43:32.5459168Z * [new branch] gh/mikaylagawarecki/366/base -> origin/gh/mikaylagawarecki/366/base 2025-12-04T09:43:32.5460832Z * [new branch] gh/mikaylagawarecki/366/head -> origin/gh/mikaylagawarecki/366/head 2025-12-04T09:43:32.5462696Z * [new branch] gh/mikaylagawarecki/366/orig -> origin/gh/mikaylagawarecki/366/orig 2025-12-04T09:43:32.5465119Z * [new branch] gh/mikaylagawarecki/367/base -> origin/gh/mikaylagawarecki/367/base 2025-12-04T09:43:32.5466818Z * [new branch] gh/mikaylagawarecki/367/head -> origin/gh/mikaylagawarecki/367/head 2025-12-04T09:43:32.5468710Z * [new branch] gh/mikaylagawarecki/367/orig -> origin/gh/mikaylagawarecki/367/orig 2025-12-04T09:43:32.5471173Z * [new branch] gh/mikaylagawarecki/368/base -> origin/gh/mikaylagawarecki/368/base 2025-12-04T09:43:32.5472883Z * [new branch] gh/mikaylagawarecki/368/head -> origin/gh/mikaylagawarecki/368/head 2025-12-04T09:43:32.5474573Z * [new branch] gh/mikaylagawarecki/368/orig -> origin/gh/mikaylagawarecki/368/orig 2025-12-04T09:43:32.5476981Z * [new branch] gh/mikaylagawarecki/369/base -> origin/gh/mikaylagawarecki/369/base 2025-12-04T09:43:32.5478838Z * [new branch] gh/mikaylagawarecki/369/head -> origin/gh/mikaylagawarecki/369/head 2025-12-04T09:43:32.5480688Z * [new branch] gh/mikaylagawarecki/369/orig -> origin/gh/mikaylagawarecki/369/orig 2025-12-04T09:43:32.5483343Z * [new branch] gh/mikaylagawarecki/370/base -> origin/gh/mikaylagawarecki/370/base 2025-12-04T09:43:32.5485051Z * [new branch] gh/mikaylagawarecki/370/head -> origin/gh/mikaylagawarecki/370/head 2025-12-04T09:43:32.5486859Z * [new branch] gh/mikaylagawarecki/370/orig -> origin/gh/mikaylagawarecki/370/orig 2025-12-04T09:43:32.5489319Z * [new branch] gh/mikaylagawarecki/371/base -> origin/gh/mikaylagawarecki/371/base 2025-12-04T09:43:32.5490966Z * [new branch] gh/mikaylagawarecki/371/head -> origin/gh/mikaylagawarecki/371/head 2025-12-04T09:43:32.5492654Z * [new branch] gh/mikaylagawarecki/371/orig -> origin/gh/mikaylagawarecki/371/orig 2025-12-04T09:43:32.5495159Z * [new branch] gh/mikaylagawarecki/372/base -> origin/gh/mikaylagawarecki/372/base 2025-12-04T09:43:32.5496899Z * [new branch] gh/mikaylagawarecki/372/head -> origin/gh/mikaylagawarecki/372/head 2025-12-04T09:43:32.5498625Z * [new branch] gh/mikaylagawarecki/372/orig -> origin/gh/mikaylagawarecki/372/orig 2025-12-04T09:43:32.5501019Z * [new branch] gh/mikaylagawarecki/373/base -> origin/gh/mikaylagawarecki/373/base 2025-12-04T09:43:32.5502806Z * [new branch] gh/mikaylagawarecki/373/head -> origin/gh/mikaylagawarecki/373/head 2025-12-04T09:43:32.5504475Z * [new branch] gh/mikaylagawarecki/373/orig -> origin/gh/mikaylagawarecki/373/orig 2025-12-04T09:43:32.5506822Z * [new branch] gh/mikaylagawarecki/374/base -> origin/gh/mikaylagawarecki/374/base 2025-12-04T09:43:32.5508653Z * [new branch] gh/mikaylagawarecki/374/head -> origin/gh/mikaylagawarecki/374/head 2025-12-04T09:43:32.5510464Z * [new branch] gh/mikaylagawarecki/374/orig -> origin/gh/mikaylagawarecki/374/orig 2025-12-04T09:43:32.5512755Z * [new branch] gh/mikaylagawarecki/375/base -> origin/gh/mikaylagawarecki/375/base 2025-12-04T09:43:32.5514519Z * [new branch] gh/mikaylagawarecki/375/head -> origin/gh/mikaylagawarecki/375/head 2025-12-04T09:43:32.5516244Z * [new branch] gh/mikaylagawarecki/375/orig -> origin/gh/mikaylagawarecki/375/orig 2025-12-04T09:43:32.5518663Z * [new branch] gh/mikaylagawarecki/376/base -> origin/gh/mikaylagawarecki/376/base 2025-12-04T09:43:32.5520566Z * [new branch] gh/mikaylagawarecki/376/head -> origin/gh/mikaylagawarecki/376/head 2025-12-04T09:43:32.5522647Z * [new branch] gh/mikaylagawarecki/376/orig -> origin/gh/mikaylagawarecki/376/orig 2025-12-04T09:43:32.5525116Z * [new branch] gh/mikaylagawarecki/377/base -> origin/gh/mikaylagawarecki/377/base 2025-12-04T09:43:32.5526904Z * [new branch] gh/mikaylagawarecki/377/head -> origin/gh/mikaylagawarecki/377/head 2025-12-04T09:43:32.5528652Z * [new branch] gh/mikaylagawarecki/377/orig -> origin/gh/mikaylagawarecki/377/orig 2025-12-04T09:43:32.5531116Z * [new branch] gh/mikaylagawarecki/378/base -> origin/gh/mikaylagawarecki/378/base 2025-12-04T09:43:32.5532873Z * [new branch] gh/mikaylagawarecki/378/head -> origin/gh/mikaylagawarecki/378/head 2025-12-04T09:43:32.5534692Z * [new branch] gh/mikaylagawarecki/378/orig -> origin/gh/mikaylagawarecki/378/orig 2025-12-04T09:43:32.5537089Z * [new branch] gh/mikaylagawarecki/379/base -> origin/gh/mikaylagawarecki/379/base 2025-12-04T09:43:32.5538829Z * [new branch] gh/mikaylagawarecki/379/head -> origin/gh/mikaylagawarecki/379/head 2025-12-04T09:43:32.5540586Z * [new branch] gh/mikaylagawarecki/379/orig -> origin/gh/mikaylagawarecki/379/orig 2025-12-04T09:43:32.5542852Z * [new branch] gh/mikaylagawarecki/380/base -> origin/gh/mikaylagawarecki/380/base 2025-12-04T09:43:32.5544554Z * [new branch] gh/mikaylagawarecki/380/head -> origin/gh/mikaylagawarecki/380/head 2025-12-04T09:43:32.5546235Z * [new branch] gh/mikaylagawarecki/380/orig -> origin/gh/mikaylagawarecki/380/orig 2025-12-04T09:43:32.5548681Z * [new branch] gh/mikaylagawarecki/381/base -> origin/gh/mikaylagawarecki/381/base 2025-12-04T09:43:32.5550858Z * [new branch] gh/mikaylagawarecki/381/head -> origin/gh/mikaylagawarecki/381/head 2025-12-04T09:43:32.5552571Z * [new branch] gh/mikaylagawarecki/381/orig -> origin/gh/mikaylagawarecki/381/orig 2025-12-04T09:43:32.5554819Z * [new branch] gh/mikaylagawarecki/382/base -> origin/gh/mikaylagawarecki/382/base 2025-12-04T09:43:32.5556735Z * [new branch] gh/mikaylagawarecki/382/head -> origin/gh/mikaylagawarecki/382/head 2025-12-04T09:43:32.5558513Z * [new branch] gh/mikaylagawarecki/382/orig -> origin/gh/mikaylagawarecki/382/orig 2025-12-04T09:43:32.5561022Z * [new branch] gh/mikaylagawarecki/383/base -> origin/gh/mikaylagawarecki/383/base 2025-12-04T09:43:32.5562849Z * [new branch] gh/mikaylagawarecki/383/head -> origin/gh/mikaylagawarecki/383/head 2025-12-04T09:43:32.5564560Z * [new branch] gh/mikaylagawarecki/383/orig -> origin/gh/mikaylagawarecki/383/orig 2025-12-04T09:43:32.5566892Z * [new branch] gh/mikaylagawarecki/384/base -> origin/gh/mikaylagawarecki/384/base 2025-12-04T09:43:32.5568646Z * [new branch] gh/mikaylagawarecki/384/head -> origin/gh/mikaylagawarecki/384/head 2025-12-04T09:43:32.5570353Z * [new branch] gh/mikaylagawarecki/384/orig -> origin/gh/mikaylagawarecki/384/orig 2025-12-04T09:43:32.5572770Z * [new branch] gh/mikaylagawarecki/385/base -> origin/gh/mikaylagawarecki/385/base 2025-12-04T09:43:32.5574513Z * [new branch] gh/mikaylagawarecki/385/head -> origin/gh/mikaylagawarecki/385/head 2025-12-04T09:43:32.5576284Z * [new branch] gh/mikaylagawarecki/385/orig -> origin/gh/mikaylagawarecki/385/orig 2025-12-04T09:43:32.5578705Z * [new branch] gh/mikaylagawarecki/386/base -> origin/gh/mikaylagawarecki/386/base 2025-12-04T09:43:32.5580393Z * [new branch] gh/mikaylagawarecki/386/head -> origin/gh/mikaylagawarecki/386/head 2025-12-04T09:43:32.5582197Z * [new branch] gh/mikaylagawarecki/386/orig -> origin/gh/mikaylagawarecki/386/orig 2025-12-04T09:43:32.5584767Z * [new branch] gh/mikaylagawarecki/387/base -> origin/gh/mikaylagawarecki/387/base 2025-12-04T09:43:32.5586355Z * [new branch] gh/mikaylagawarecki/387/head -> origin/gh/mikaylagawarecki/387/head 2025-12-04T09:43:32.5588136Z * [new branch] gh/mikaylagawarecki/387/orig -> origin/gh/mikaylagawarecki/387/orig 2025-12-04T09:43:32.5590372Z * [new branch] gh/mikaylagawarecki/388/base -> origin/gh/mikaylagawarecki/388/base 2025-12-04T09:43:32.5592070Z * [new branch] gh/mikaylagawarecki/388/head -> origin/gh/mikaylagawarecki/388/head 2025-12-04T09:43:32.5593801Z * [new branch] gh/mikaylagawarecki/388/orig -> origin/gh/mikaylagawarecki/388/orig 2025-12-04T09:43:32.5596636Z * [new branch] gh/mikaylagawarecki/389/base -> origin/gh/mikaylagawarecki/389/base 2025-12-04T09:43:32.5598369Z * [new branch] gh/mikaylagawarecki/389/head -> origin/gh/mikaylagawarecki/389/head 2025-12-04T09:43:32.5600304Z * [new branch] gh/mikaylagawarecki/389/orig -> origin/gh/mikaylagawarecki/389/orig 2025-12-04T09:43:32.5602862Z * [new branch] gh/mikaylagawarecki/390/base -> origin/gh/mikaylagawarecki/390/base 2025-12-04T09:43:32.5604566Z * [new branch] gh/mikaylagawarecki/390/head -> origin/gh/mikaylagawarecki/390/head 2025-12-04T09:43:32.5606302Z * [new branch] gh/mikaylagawarecki/390/orig -> origin/gh/mikaylagawarecki/390/orig 2025-12-04T09:43:32.5608834Z * [new branch] gh/mikaylagawarecki/391/base -> origin/gh/mikaylagawarecki/391/base 2025-12-04T09:43:32.5610569Z * [new branch] gh/mikaylagawarecki/391/head -> origin/gh/mikaylagawarecki/391/head 2025-12-04T09:43:32.5612352Z * [new branch] gh/mikaylagawarecki/391/orig -> origin/gh/mikaylagawarecki/391/orig 2025-12-04T09:43:32.5614853Z * [new branch] gh/mikaylagawarecki/392/base -> origin/gh/mikaylagawarecki/392/base 2025-12-04T09:43:32.5616599Z * [new branch] gh/mikaylagawarecki/392/head -> origin/gh/mikaylagawarecki/392/head 2025-12-04T09:43:32.5618315Z * [new branch] gh/mikaylagawarecki/392/orig -> origin/gh/mikaylagawarecki/392/orig 2025-12-04T09:43:32.5621090Z * [new branch] gh/mlazos/41/base -> origin/gh/mlazos/41/base 2025-12-04T09:43:32.5622801Z * [new branch] gh/mlazos/41/head -> origin/gh/mlazos/41/head 2025-12-04T09:43:32.5624504Z * [new branch] gh/mlazos/41/orig -> origin/gh/mlazos/41/orig 2025-12-04T09:43:32.5626861Z * [new branch] gh/mlazos/42/base -> origin/gh/mlazos/42/base 2025-12-04T09:43:32.5628686Z * [new branch] gh/mlazos/42/head -> origin/gh/mlazos/42/head 2025-12-04T09:43:32.5630364Z * [new branch] gh/mlazos/42/orig -> origin/gh/mlazos/42/orig 2025-12-04T09:43:32.5632525Z * [new branch] gh/mlazos/43/base -> origin/gh/mlazos/43/base 2025-12-04T09:43:32.5634322Z * [new branch] gh/mlazos/43/head -> origin/gh/mlazos/43/head 2025-12-04T09:43:32.5635989Z * [new branch] gh/mlazos/43/orig -> origin/gh/mlazos/43/orig 2025-12-04T09:43:32.5638250Z * [new branch] gh/mlazos/44/base -> origin/gh/mlazos/44/base 2025-12-04T09:43:32.5639951Z * [new branch] gh/mlazos/44/head -> origin/gh/mlazos/44/head 2025-12-04T09:43:32.5641714Z * [new branch] gh/mlazos/44/orig -> origin/gh/mlazos/44/orig 2025-12-04T09:43:32.5643987Z * [new branch] gh/mlazos/47/base -> origin/gh/mlazos/47/base 2025-12-04T09:43:32.5645723Z * [new branch] gh/mlazos/47/head -> origin/gh/mlazos/47/head 2025-12-04T09:43:32.5647427Z * [new branch] gh/mlazos/47/orig -> origin/gh/mlazos/47/orig 2025-12-04T09:43:32.5649717Z * [new branch] gh/mlazos/48/base -> origin/gh/mlazos/48/base 2025-12-04T09:43:32.5651825Z * [new branch] gh/mlazos/48/head -> origin/gh/mlazos/48/head 2025-12-04T09:43:32.5653234Z * [new branch] gh/mlazos/48/orig -> origin/gh/mlazos/48/orig 2025-12-04T09:43:32.5655706Z * [new branch] gh/mlazos/49/base -> origin/gh/mlazos/49/base 2025-12-04T09:43:32.5657595Z * [new branch] gh/mlazos/49/head -> origin/gh/mlazos/49/head 2025-12-04T09:43:32.5659126Z * [new branch] gh/mlazos/49/orig -> origin/gh/mlazos/49/orig 2025-12-04T09:43:32.5661443Z * [new branch] gh/mlazos/50/base -> origin/gh/mlazos/50/base 2025-12-04T09:43:32.5663119Z * [new branch] gh/mlazos/50/head -> origin/gh/mlazos/50/head 2025-12-04T09:43:32.5664861Z * [new branch] gh/mlazos/50/orig -> origin/gh/mlazos/50/orig 2025-12-04T09:43:32.5667034Z * [new branch] gh/mlazos/51/base -> origin/gh/mlazos/51/base 2025-12-04T09:43:32.5668891Z * [new branch] gh/mlazos/51/head -> origin/gh/mlazos/51/head 2025-12-04T09:43:32.5670599Z * [new branch] gh/mlazos/51/orig -> origin/gh/mlazos/51/orig 2025-12-04T09:43:32.5672927Z * [new branch] gh/mlazos/52/base -> origin/gh/mlazos/52/base 2025-12-04T09:43:32.5674654Z * [new branch] gh/mlazos/52/head -> origin/gh/mlazos/52/head 2025-12-04T09:43:32.5676321Z * [new branch] gh/mlazos/52/orig -> origin/gh/mlazos/52/orig 2025-12-04T09:43:32.5678662Z * [new branch] gh/mlazos/53/base -> origin/gh/mlazos/53/base 2025-12-04T09:43:32.5680373Z * [new branch] gh/mlazos/53/head -> origin/gh/mlazos/53/head 2025-12-04T09:43:32.5682110Z * [new branch] gh/mlazos/53/orig -> origin/gh/mlazos/53/orig 2025-12-04T09:43:32.5684390Z * [new branch] gh/mlazos/54/base -> origin/gh/mlazos/54/base 2025-12-04T09:43:32.5686287Z * [new branch] gh/mlazos/54/head -> origin/gh/mlazos/54/head 2025-12-04T09:43:32.5687990Z * [new branch] gh/mlazos/54/orig -> origin/gh/mlazos/54/orig 2025-12-04T09:43:32.5690189Z * [new branch] gh/mlazos/55/base -> origin/gh/mlazos/55/base 2025-12-04T09:43:32.5691887Z * [new branch] gh/mlazos/55/head -> origin/gh/mlazos/55/head 2025-12-04T09:43:32.5693561Z * [new branch] gh/mlazos/55/orig -> origin/gh/mlazos/55/orig 2025-12-04T09:43:32.5695939Z * [new branch] gh/mlazos/56/base -> origin/gh/mlazos/56/base 2025-12-04T09:43:32.5697722Z * [new branch] gh/mlazos/56/head -> origin/gh/mlazos/56/head 2025-12-04T09:43:32.5699407Z * [new branch] gh/mlazos/56/orig -> origin/gh/mlazos/56/orig 2025-12-04T09:43:32.5701755Z * [new branch] gh/mlazos/57/base -> origin/gh/mlazos/57/base 2025-12-04T09:43:32.5703423Z * [new branch] gh/mlazos/57/head -> origin/gh/mlazos/57/head 2025-12-04T09:43:32.5705142Z * [new branch] gh/mlazos/57/orig -> origin/gh/mlazos/57/orig 2025-12-04T09:43:32.5708050Z * [new branch] gh/mlazos/58/base -> origin/gh/mlazos/58/base 2025-12-04T09:43:32.5709728Z * [new branch] gh/mlazos/58/head -> origin/gh/mlazos/58/head 2025-12-04T09:43:32.5711448Z * [new branch] gh/mlazos/58/orig -> origin/gh/mlazos/58/orig 2025-12-04T09:43:32.5713721Z * [new branch] gh/mlazos/59/base -> origin/gh/mlazos/59/base 2025-12-04T09:43:32.5715422Z * [new branch] gh/mlazos/59/head -> origin/gh/mlazos/59/head 2025-12-04T09:43:32.5717040Z * [new branch] gh/mlazos/59/orig -> origin/gh/mlazos/59/orig 2025-12-04T09:43:32.5719425Z * [new branch] gh/mlazos/60/base -> origin/gh/mlazos/60/base 2025-12-04T09:43:32.5721330Z * [new branch] gh/mlazos/60/head -> origin/gh/mlazos/60/head 2025-12-04T09:43:32.5722915Z * [new branch] gh/mlazos/60/orig -> origin/gh/mlazos/60/orig 2025-12-04T09:43:32.5725658Z * [new branch] gh/mlazos/61/base -> origin/gh/mlazos/61/base 2025-12-04T09:43:32.5728989Z * [new branch] gh/mlazos/61/head -> origin/gh/mlazos/61/head 2025-12-04T09:43:32.5729731Z * [new branch] gh/mlazos/61/orig -> origin/gh/mlazos/61/orig 2025-12-04T09:43:32.5731594Z * [new branch] gh/mlazos/62/base -> origin/gh/mlazos/62/base 2025-12-04T09:43:32.5733228Z * [new branch] gh/mlazos/62/head -> origin/gh/mlazos/62/head 2025-12-04T09:43:32.5734902Z * [new branch] gh/mlazos/62/orig -> origin/gh/mlazos/62/orig 2025-12-04T09:43:32.5737257Z * [new branch] gh/mlazos/63/base -> origin/gh/mlazos/63/base 2025-12-04T09:43:32.5739039Z * [new branch] gh/mlazos/63/head -> origin/gh/mlazos/63/head 2025-12-04T09:43:32.5740680Z * [new branch] gh/mlazos/63/orig -> origin/gh/mlazos/63/orig 2025-12-04T09:43:32.5742980Z * [new branch] gh/mlazos/64/base -> origin/gh/mlazos/64/base 2025-12-04T09:43:32.5744767Z * [new branch] gh/mlazos/64/head -> origin/gh/mlazos/64/head 2025-12-04T09:43:32.5746496Z * [new branch] gh/mlazos/64/orig -> origin/gh/mlazos/64/orig 2025-12-04T09:43:32.5749064Z * [new branch] gh/mlazos/65/base -> origin/gh/mlazos/65/base 2025-12-04T09:43:32.5750847Z * [new branch] gh/mlazos/65/head -> origin/gh/mlazos/65/head 2025-12-04T09:43:32.5752561Z * [new branch] gh/mlazos/65/orig -> origin/gh/mlazos/65/orig 2025-12-04T09:43:32.5754921Z * [new branch] gh/mlazos/66/base -> origin/gh/mlazos/66/base 2025-12-04T09:43:32.5758379Z * [new branch] gh/mlazos/66/head -> origin/gh/mlazos/66/head 2025-12-04T09:43:32.5760086Z * [new branch] gh/mlazos/66/orig -> origin/gh/mlazos/66/orig 2025-12-04T09:43:32.5762521Z * [new branch] gh/mlazos/67/base -> origin/gh/mlazos/67/base 2025-12-04T09:43:32.5764260Z * [new branch] gh/mlazos/67/head -> origin/gh/mlazos/67/head 2025-12-04T09:43:32.5765916Z * [new branch] gh/mlazos/67/orig -> origin/gh/mlazos/67/orig 2025-12-04T09:43:32.5768288Z * [new branch] gh/mlazos/68/base -> origin/gh/mlazos/68/base 2025-12-04T09:43:32.5770092Z * [new branch] gh/mlazos/68/head -> origin/gh/mlazos/68/head 2025-12-04T09:43:32.5771861Z * [new branch] gh/mlazos/68/orig -> origin/gh/mlazos/68/orig 2025-12-04T09:43:32.5774236Z * [new branch] gh/mlazos/69/base -> origin/gh/mlazos/69/base 2025-12-04T09:43:32.5775943Z * [new branch] gh/mlazos/69/head -> origin/gh/mlazos/69/head 2025-12-04T09:43:32.5777674Z * [new branch] gh/mlazos/69/orig -> origin/gh/mlazos/69/orig 2025-12-04T09:43:32.5779977Z * [new branch] gh/mlazos/70/base -> origin/gh/mlazos/70/base 2025-12-04T09:43:32.5781663Z * [new branch] gh/mlazos/70/head -> origin/gh/mlazos/70/head 2025-12-04T09:43:32.5783430Z * [new branch] gh/mlazos/70/orig -> origin/gh/mlazos/70/orig 2025-12-04T09:43:32.5785764Z * [new branch] gh/mlazos/71/base -> origin/gh/mlazos/71/base 2025-12-04T09:43:32.5788093Z * [new branch] gh/mlazos/71/head -> origin/gh/mlazos/71/head 2025-12-04T09:43:32.5789784Z * [new branch] gh/mlazos/71/orig -> origin/gh/mlazos/71/orig 2025-12-04T09:43:32.5792099Z * [new branch] gh/mlazos/72/base -> origin/gh/mlazos/72/base 2025-12-04T09:43:32.5794062Z * [new branch] gh/mlazos/72/head -> origin/gh/mlazos/72/head 2025-12-04T09:43:32.5795685Z * [new branch] gh/mlazos/72/orig -> origin/gh/mlazos/72/orig 2025-12-04T09:43:32.5798090Z * [new branch] gh/mlazos/73/base -> origin/gh/mlazos/73/base 2025-12-04T09:43:32.5799802Z * [new branch] gh/mlazos/73/head -> origin/gh/mlazos/73/head 2025-12-04T09:43:32.5801559Z * [new branch] gh/mlazos/73/orig -> origin/gh/mlazos/73/orig 2025-12-04T09:43:32.5804507Z * [new branch] gh/mrmiywj/1/base -> origin/gh/mrmiywj/1/base 2025-12-04T09:43:32.5806219Z * [new branch] gh/mrmiywj/1/head -> origin/gh/mrmiywj/1/head 2025-12-04T09:43:32.5809021Z * [new branch] gh/muchulee8/73/base -> origin/gh/muchulee8/73/base 2025-12-04T09:43:32.5810953Z * [new branch] gh/muchulee8/73/head -> origin/gh/muchulee8/73/head 2025-12-04T09:43:32.5812826Z * [new branch] gh/muchulee8/73/orig -> origin/gh/muchulee8/73/orig 2025-12-04T09:43:32.5816596Z * [new branch] gh/naveenthangudu/1/base -> origin/gh/naveenthangudu/1/base 2025-12-04T09:43:32.5818410Z * [new branch] gh/naveenthangudu/1/head -> origin/gh/naveenthangudu/1/head 2025-12-04T09:43:32.5820379Z * [new branch] gh/naveenthangudu/1/orig -> origin/gh/naveenthangudu/1/orig 2025-12-04T09:43:32.5822883Z * [new branch] gh/naveenthangudu/2/base -> origin/gh/naveenthangudu/2/base 2025-12-04T09:43:32.5824620Z * [new branch] gh/naveenthangudu/2/head -> origin/gh/naveenthangudu/2/head 2025-12-04T09:43:32.5826338Z * [new branch] gh/naveenthangudu/2/orig -> origin/gh/naveenthangudu/2/orig 2025-12-04T09:43:32.5828680Z * [new branch] gh/naveenthangudu/3/base -> origin/gh/naveenthangudu/3/base 2025-12-04T09:43:32.5830400Z * [new branch] gh/naveenthangudu/3/head -> origin/gh/naveenthangudu/3/head 2025-12-04T09:43:32.5832166Z * [new branch] gh/naveenthangudu/3/orig -> origin/gh/naveenthangudu/3/orig 2025-12-04T09:43:32.5834456Z * [new branch] gh/naveenthangudu/4/base -> origin/gh/naveenthangudu/4/base 2025-12-04T09:43:32.5836143Z * [new branch] gh/naveenthangudu/4/head -> origin/gh/naveenthangudu/4/head 2025-12-04T09:43:32.5837936Z * [new branch] gh/naveenthangudu/4/orig -> origin/gh/naveenthangudu/4/orig 2025-12-04T09:43:32.5840300Z * [new branch] gh/naveenthangudu/5/base -> origin/gh/naveenthangudu/5/base 2025-12-04T09:43:32.5842104Z * [new branch] gh/naveenthangudu/5/head -> origin/gh/naveenthangudu/5/head 2025-12-04T09:43:32.5843961Z * [new branch] gh/naveenthangudu/5/orig -> origin/gh/naveenthangudu/5/orig 2025-12-04T09:43:32.5846333Z * [new branch] gh/naveenthangudu/6/base -> origin/gh/naveenthangudu/6/base 2025-12-04T09:43:32.5848078Z * [new branch] gh/naveenthangudu/6/head -> origin/gh/naveenthangudu/6/head 2025-12-04T09:43:32.5849730Z * [new branch] gh/naveenthangudu/6/orig -> origin/gh/naveenthangudu/6/orig 2025-12-04T09:43:32.5852085Z * [new branch] gh/naveenthangudu/7/base -> origin/gh/naveenthangudu/7/base 2025-12-04T09:43:32.5853843Z * [new branch] gh/naveenthangudu/7/head -> origin/gh/naveenthangudu/7/head 2025-12-04T09:43:32.5855719Z * [new branch] gh/naveenthangudu/7/orig -> origin/gh/naveenthangudu/7/orig 2025-12-04T09:43:32.5857975Z * [new branch] gh/naveenthangudu/8/base -> origin/gh/naveenthangudu/8/base 2025-12-04T09:43:32.5859733Z * [new branch] gh/naveenthangudu/8/head -> origin/gh/naveenthangudu/8/head 2025-12-04T09:43:32.5861449Z * [new branch] gh/naveenthangudu/8/orig -> origin/gh/naveenthangudu/8/orig 2025-12-04T09:43:32.5864012Z * [new branch] gh/naveenthangudu/9/base -> origin/gh/naveenthangudu/9/base 2025-12-04T09:43:32.5865659Z * [new branch] gh/naveenthangudu/9/head -> origin/gh/naveenthangudu/9/head 2025-12-04T09:43:32.5867473Z * [new branch] gh/naveenthangudu/9/orig -> origin/gh/naveenthangudu/9/orig 2025-12-04T09:43:32.5870347Z * [new branch] gh/nikitaved/1/base -> origin/gh/nikitaved/1/base 2025-12-04T09:43:32.5872044Z * [new branch] gh/nikitaved/1/head -> origin/gh/nikitaved/1/head 2025-12-04T09:43:32.5873769Z * [new branch] gh/nikitaved/1/orig -> origin/gh/nikitaved/1/orig 2025-12-04T09:43:32.5876076Z * [new branch] gh/nikitaved/10/base -> origin/gh/nikitaved/10/base 2025-12-04T09:43:32.5877769Z * [new branch] gh/nikitaved/10/head -> origin/gh/nikitaved/10/head 2025-12-04T09:43:32.5879507Z * [new branch] gh/nikitaved/10/orig -> origin/gh/nikitaved/10/orig 2025-12-04T09:43:32.5881742Z * [new branch] gh/nikitaved/11/base -> origin/gh/nikitaved/11/base 2025-12-04T09:43:32.5883539Z * [new branch] gh/nikitaved/11/head -> origin/gh/nikitaved/11/head 2025-12-04T09:43:32.5885835Z * [new branch] gh/nikitaved/11/orig -> origin/gh/nikitaved/11/orig 2025-12-04T09:43:32.5888070Z * [new branch] gh/nikitaved/12/base -> origin/gh/nikitaved/12/base 2025-12-04T09:43:32.5889764Z * [new branch] gh/nikitaved/12/head -> origin/gh/nikitaved/12/head 2025-12-04T09:43:32.5891592Z * [new branch] gh/nikitaved/12/orig -> origin/gh/nikitaved/12/orig 2025-12-04T09:43:32.5893882Z * [new branch] gh/nikitaved/13/base -> origin/gh/nikitaved/13/base 2025-12-04T09:43:32.5895617Z * [new branch] gh/nikitaved/13/head -> origin/gh/nikitaved/13/head 2025-12-04T09:43:32.5897361Z * [new branch] gh/nikitaved/13/orig -> origin/gh/nikitaved/13/orig 2025-12-04T09:43:32.5899749Z * [new branch] gh/nikitaved/14/base -> origin/gh/nikitaved/14/base 2025-12-04T09:43:32.5901451Z * [new branch] gh/nikitaved/14/head -> origin/gh/nikitaved/14/head 2025-12-04T09:43:32.5903162Z * [new branch] gh/nikitaved/14/orig -> origin/gh/nikitaved/14/orig 2025-12-04T09:43:32.5905391Z * [new branch] gh/nikitaved/15/base -> origin/gh/nikitaved/15/base 2025-12-04T09:43:32.5907123Z * [new branch] gh/nikitaved/15/head -> origin/gh/nikitaved/15/head 2025-12-04T09:43:32.5909004Z * [new branch] gh/nikitaved/15/orig -> origin/gh/nikitaved/15/orig 2025-12-04T09:43:32.5911264Z * [new branch] gh/nikitaved/16/base -> origin/gh/nikitaved/16/base 2025-12-04T09:43:32.5913022Z * [new branch] gh/nikitaved/16/head -> origin/gh/nikitaved/16/head 2025-12-04T09:43:32.5914767Z * [new branch] gh/nikitaved/16/orig -> origin/gh/nikitaved/16/orig 2025-12-04T09:43:32.5917090Z * [new branch] gh/nikitaved/2/base -> origin/gh/nikitaved/2/base 2025-12-04T09:43:32.5918810Z * [new branch] gh/nikitaved/2/head -> origin/gh/nikitaved/2/head 2025-12-04T09:43:32.5920575Z * [new branch] gh/nikitaved/2/orig -> origin/gh/nikitaved/2/orig 2025-12-04T09:43:32.5922848Z * [new branch] gh/nikitaved/4/base -> origin/gh/nikitaved/4/base 2025-12-04T09:43:32.5924535Z * [new branch] gh/nikitaved/4/head -> origin/gh/nikitaved/4/head 2025-12-04T09:43:32.5926253Z * [new branch] gh/nikitaved/4/orig -> origin/gh/nikitaved/4/orig 2025-12-04T09:43:32.5928586Z * [new branch] gh/nikitaved/5/base -> origin/gh/nikitaved/5/base 2025-12-04T09:43:32.5930329Z * [new branch] gh/nikitaved/5/head -> origin/gh/nikitaved/5/head 2025-12-04T09:43:32.5932646Z * [new branch] gh/nikitaved/5/orig -> origin/gh/nikitaved/5/orig 2025-12-04T09:43:32.5934799Z * [new branch] gh/nikitaved/6/base -> origin/gh/nikitaved/6/base 2025-12-04T09:43:32.5936517Z * [new branch] gh/nikitaved/6/head -> origin/gh/nikitaved/6/head 2025-12-04T09:43:32.5938226Z * [new branch] gh/nikitaved/6/orig -> origin/gh/nikitaved/6/orig 2025-12-04T09:43:32.5940520Z * [new branch] gh/nikitaved/8/base -> origin/gh/nikitaved/8/base 2025-12-04T09:43:32.5942280Z * [new branch] gh/nikitaved/8/head -> origin/gh/nikitaved/8/head 2025-12-04T09:43:32.5944000Z * [new branch] gh/nikitaved/8/orig -> origin/gh/nikitaved/8/orig 2025-12-04T09:43:32.5946334Z * [new branch] gh/nikitaved/9/base -> origin/gh/nikitaved/9/base 2025-12-04T09:43:32.5948113Z * [new branch] gh/nikitaved/9/head -> origin/gh/nikitaved/9/head 2025-12-04T09:43:32.5949802Z * [new branch] gh/nikitaved/9/orig -> origin/gh/nikitaved/9/orig 2025-12-04T09:43:32.5952536Z * [new branch] gh/oulgen/10/base -> origin/gh/oulgen/10/base 2025-12-04T09:43:32.5954302Z * [new branch] gh/oulgen/10/head -> origin/gh/oulgen/10/head 2025-12-04T09:43:32.5956282Z * [new branch] gh/oulgen/10/orig -> origin/gh/oulgen/10/orig 2025-12-04T09:43:32.5958596Z * [new branch] gh/oulgen/11/base -> origin/gh/oulgen/11/base 2025-12-04T09:43:32.5960395Z * [new branch] gh/oulgen/11/head -> origin/gh/oulgen/11/head 2025-12-04T09:43:32.5962125Z * [new branch] gh/oulgen/11/orig -> origin/gh/oulgen/11/orig 2025-12-04T09:43:32.5964390Z * [new branch] gh/oulgen/12/base -> origin/gh/oulgen/12/base 2025-12-04T09:43:32.5966123Z * [new branch] gh/oulgen/12/head -> origin/gh/oulgen/12/head 2025-12-04T09:43:32.5967812Z * [new branch] gh/oulgen/12/orig -> origin/gh/oulgen/12/orig 2025-12-04T09:43:32.5970094Z * [new branch] gh/oulgen/13/base -> origin/gh/oulgen/13/base 2025-12-04T09:43:32.5971699Z * [new branch] gh/oulgen/13/head -> origin/gh/oulgen/13/head 2025-12-04T09:43:32.5973375Z * [new branch] gh/oulgen/13/orig -> origin/gh/oulgen/13/orig 2025-12-04T09:43:32.5975663Z * [new branch] gh/oulgen/14/base -> origin/gh/oulgen/14/base 2025-12-04T09:43:32.5977611Z * [new branch] gh/oulgen/14/head -> origin/gh/oulgen/14/head 2025-12-04T09:43:32.5979207Z * [new branch] gh/oulgen/14/orig -> origin/gh/oulgen/14/orig 2025-12-04T09:43:32.5981517Z * [new branch] gh/oulgen/15/base -> origin/gh/oulgen/15/base 2025-12-04T09:43:32.5983288Z * [new branch] gh/oulgen/15/head -> origin/gh/oulgen/15/head 2025-12-04T09:43:32.5984988Z * [new branch] gh/oulgen/15/orig -> origin/gh/oulgen/15/orig 2025-12-04T09:43:32.5987288Z * [new branch] gh/oulgen/16/base -> origin/gh/oulgen/16/base 2025-12-04T09:43:32.5989115Z * [new branch] gh/oulgen/16/head -> origin/gh/oulgen/16/head 2025-12-04T09:43:32.5990746Z * [new branch] gh/oulgen/16/orig -> origin/gh/oulgen/16/orig 2025-12-04T09:43:32.5993029Z * [new branch] gh/oulgen/17/base -> origin/gh/oulgen/17/base 2025-12-04T09:43:32.5994689Z * [new branch] gh/oulgen/17/head -> origin/gh/oulgen/17/head 2025-12-04T09:43:32.5996698Z * [new branch] gh/oulgen/17/orig -> origin/gh/oulgen/17/orig 2025-12-04T09:43:32.5998718Z * [new branch] gh/oulgen/18/base -> origin/gh/oulgen/18/base 2025-12-04T09:43:32.6000485Z * [new branch] gh/oulgen/18/head -> origin/gh/oulgen/18/head 2025-12-04T09:43:32.6002370Z * [new branch] gh/oulgen/18/orig -> origin/gh/oulgen/18/orig 2025-12-04T09:43:32.6004479Z * [new branch] gh/oulgen/19/base -> origin/gh/oulgen/19/base 2025-12-04T09:43:32.6006216Z * [new branch] gh/oulgen/19/head -> origin/gh/oulgen/19/head 2025-12-04T09:43:32.6008353Z * [new branch] gh/oulgen/19/orig -> origin/gh/oulgen/19/orig 2025-12-04T09:43:32.6010700Z * [new branch] gh/oulgen/20/base -> origin/gh/oulgen/20/base 2025-12-04T09:43:32.6012476Z * [new branch] gh/oulgen/20/head -> origin/gh/oulgen/20/head 2025-12-04T09:43:32.6014124Z * [new branch] gh/oulgen/20/orig -> origin/gh/oulgen/20/orig 2025-12-04T09:43:32.6016341Z * [new branch] gh/oulgen/21/base -> origin/gh/oulgen/21/base 2025-12-04T09:43:32.6018048Z * [new branch] gh/oulgen/21/head -> origin/gh/oulgen/21/head 2025-12-04T09:43:32.6019848Z * [new branch] gh/oulgen/21/orig -> origin/gh/oulgen/21/orig 2025-12-04T09:43:32.6022009Z * [new branch] gh/oulgen/22/base -> origin/gh/oulgen/22/base 2025-12-04T09:43:32.6023779Z * [new branch] gh/oulgen/22/head -> origin/gh/oulgen/22/head 2025-12-04T09:43:32.6025464Z * [new branch] gh/oulgen/22/orig -> origin/gh/oulgen/22/orig 2025-12-04T09:43:32.6027826Z * [new branch] gh/oulgen/23/base -> origin/gh/oulgen/23/base 2025-12-04T09:43:32.6029584Z * [new branch] gh/oulgen/23/head -> origin/gh/oulgen/23/head 2025-12-04T09:43:32.6031305Z * [new branch] gh/oulgen/23/orig -> origin/gh/oulgen/23/orig 2025-12-04T09:43:32.6033518Z * [new branch] gh/oulgen/24/base -> origin/gh/oulgen/24/base 2025-12-04T09:43:32.6035218Z * [new branch] gh/oulgen/24/head -> origin/gh/oulgen/24/head 2025-12-04T09:43:32.6036884Z * [new branch] gh/oulgen/24/orig -> origin/gh/oulgen/24/orig 2025-12-04T09:43:32.6039264Z * [new branch] gh/oulgen/25/base -> origin/gh/oulgen/25/base 2025-12-04T09:43:32.6040891Z * [new branch] gh/oulgen/25/head -> origin/gh/oulgen/25/head 2025-12-04T09:43:32.6042658Z * [new branch] gh/oulgen/25/orig -> origin/gh/oulgen/25/orig 2025-12-04T09:43:32.6044913Z * [new branch] gh/oulgen/26/base -> origin/gh/oulgen/26/base 2025-12-04T09:43:32.6046708Z * [new branch] gh/oulgen/26/head -> origin/gh/oulgen/26/head 2025-12-04T09:43:32.6048453Z * [new branch] gh/oulgen/26/orig -> origin/gh/oulgen/26/orig 2025-12-04T09:43:32.6050825Z * [new branch] gh/oulgen/4/base -> origin/gh/oulgen/4/base 2025-12-04T09:43:32.6052554Z * [new branch] gh/oulgen/4/head -> origin/gh/oulgen/4/head 2025-12-04T09:43:32.6054293Z * [new branch] gh/oulgen/4/orig -> origin/gh/oulgen/4/orig 2025-12-04T09:43:32.6057403Z * [new branch] gh/oulgen/7/base -> origin/gh/oulgen/7/base 2025-12-04T09:43:32.6059067Z * [new branch] gh/oulgen/7/head -> origin/gh/oulgen/7/head 2025-12-04T09:43:32.6060746Z * [new branch] gh/oulgen/7/orig -> origin/gh/oulgen/7/orig 2025-12-04T09:43:32.6063162Z * [new branch] gh/oulgen/8/base -> origin/gh/oulgen/8/base 2025-12-04T09:43:32.6064903Z * [new branch] gh/oulgen/8/head -> origin/gh/oulgen/8/head 2025-12-04T09:43:32.6066592Z * [new branch] gh/oulgen/8/orig -> origin/gh/oulgen/8/orig 2025-12-04T09:43:32.6069005Z * [new branch] gh/oulgen/9/base -> origin/gh/oulgen/9/base 2025-12-04T09:43:32.6070718Z * [new branch] gh/oulgen/9/head -> origin/gh/oulgen/9/head 2025-12-04T09:43:32.6072562Z * [new branch] gh/oulgen/9/orig -> origin/gh/oulgen/9/orig 2025-12-04T09:43:32.6074987Z * [new branch] gh/patvig/mtia-serialization -> origin/gh/patvig/mtia-serialization 2025-12-04T09:43:32.6078047Z * [new branch] gh/pearu/108/base -> origin/gh/pearu/108/base 2025-12-04T09:43:32.6079534Z * [new branch] gh/pearu/108/head -> origin/gh/pearu/108/head 2025-12-04T09:43:32.6081366Z * [new branch] gh/pearu/108/orig -> origin/gh/pearu/108/orig 2025-12-04T09:43:32.6084074Z * [new branch] gh/pearu/109/base -> origin/gh/pearu/109/base 2025-12-04T09:43:32.6085750Z * [new branch] gh/pearu/109/head -> origin/gh/pearu/109/head 2025-12-04T09:43:32.6087466Z * [new branch] gh/pearu/109/orig -> origin/gh/pearu/109/orig 2025-12-04T09:43:32.6089780Z * [new branch] gh/pearu/110/base -> origin/gh/pearu/110/base 2025-12-04T09:43:32.6091520Z * [new branch] gh/pearu/110/head -> origin/gh/pearu/110/head 2025-12-04T09:43:32.6093494Z * [new branch] gh/pearu/110/orig -> origin/gh/pearu/110/orig 2025-12-04T09:43:32.6095787Z * [new branch] gh/pearu/111/base -> origin/gh/pearu/111/base 2025-12-04T09:43:32.6097612Z * [new branch] gh/pearu/111/head -> origin/gh/pearu/111/head 2025-12-04T09:43:32.6099786Z * [new branch] gh/pearu/111/orig -> origin/gh/pearu/111/orig 2025-12-04T09:43:32.6102501Z * [new branch] gh/pearu/112/base -> origin/gh/pearu/112/base 2025-12-04T09:43:32.6104245Z * [new branch] gh/pearu/112/head -> origin/gh/pearu/112/head 2025-12-04T09:43:32.6106035Z * [new branch] gh/pearu/112/orig -> origin/gh/pearu/112/orig 2025-12-04T09:43:32.6108378Z * [new branch] gh/pearu/115/base -> origin/gh/pearu/115/base 2025-12-04T09:43:32.6110125Z * [new branch] gh/pearu/115/head -> origin/gh/pearu/115/head 2025-12-04T09:43:32.6111802Z * [new branch] gh/pearu/115/orig -> origin/gh/pearu/115/orig 2025-12-04T09:43:32.6114049Z * [new branch] gh/pearu/116/base -> origin/gh/pearu/116/base 2025-12-04T09:43:32.6115704Z * [new branch] gh/pearu/116/head -> origin/gh/pearu/116/head 2025-12-04T09:43:32.6117469Z * [new branch] gh/pearu/116/orig -> origin/gh/pearu/116/orig 2025-12-04T09:43:32.6119773Z * [new branch] gh/pearu/117/base -> origin/gh/pearu/117/base 2025-12-04T09:43:32.6121971Z * [new branch] gh/pearu/117/head -> origin/gh/pearu/117/head 2025-12-04T09:43:32.6123712Z * [new branch] gh/pearu/117/orig -> origin/gh/pearu/117/orig 2025-12-04T09:43:32.6125925Z * [new branch] gh/pearu/118/base -> origin/gh/pearu/118/base 2025-12-04T09:43:32.6127690Z * [new branch] gh/pearu/118/head -> origin/gh/pearu/118/head 2025-12-04T09:43:32.6129429Z * [new branch] gh/pearu/118/orig -> origin/gh/pearu/118/orig 2025-12-04T09:43:32.6131661Z * [new branch] gh/pearu/119/base -> origin/gh/pearu/119/base 2025-12-04T09:43:32.6133378Z * [new branch] gh/pearu/119/head -> origin/gh/pearu/119/head 2025-12-04T09:43:32.6135057Z * [new branch] gh/pearu/119/orig -> origin/gh/pearu/119/orig 2025-12-04T09:43:32.6137397Z * [new branch] gh/pearu/139/base -> origin/gh/pearu/139/base 2025-12-04T09:43:32.6139066Z * [new branch] gh/pearu/139/head -> origin/gh/pearu/139/head 2025-12-04T09:43:32.6140848Z * [new branch] gh/pearu/139/orig -> origin/gh/pearu/139/orig 2025-12-04T09:43:32.6143224Z * [new branch] gh/pearu/140/base -> origin/gh/pearu/140/base 2025-12-04T09:43:32.6145116Z * [new branch] gh/pearu/140/head -> origin/gh/pearu/140/head 2025-12-04T09:43:32.6146715Z * [new branch] gh/pearu/140/orig -> origin/gh/pearu/140/orig 2025-12-04T09:43:32.6149081Z * [new branch] gh/pearu/142/base -> origin/gh/pearu/142/base 2025-12-04T09:43:32.6150756Z * [new branch] gh/pearu/142/head -> origin/gh/pearu/142/head 2025-12-04T09:43:32.6152422Z * [new branch] gh/pearu/142/orig -> origin/gh/pearu/142/orig 2025-12-04T09:43:32.6154820Z * [new branch] gh/pearu/143/base -> origin/gh/pearu/143/base 2025-12-04T09:43:32.6158719Z * [new branch] gh/pearu/143/head -> origin/gh/pearu/143/head 2025-12-04T09:43:32.6160772Z * [new branch] gh/pearu/143/orig -> origin/gh/pearu/143/orig 2025-12-04T09:43:32.6163219Z * [new branch] gh/pearu/147/base -> origin/gh/pearu/147/base 2025-12-04T09:43:32.6164938Z * [new branch] gh/pearu/147/head -> origin/gh/pearu/147/head 2025-12-04T09:43:32.6166755Z * [new branch] gh/pearu/147/orig -> origin/gh/pearu/147/orig 2025-12-04T09:43:32.6169057Z * [new branch] gh/pearu/149/base -> origin/gh/pearu/149/base 2025-12-04T09:43:32.6171085Z * [new branch] gh/pearu/149/head -> origin/gh/pearu/149/head 2025-12-04T09:43:32.6172864Z * [new branch] gh/pearu/149/orig -> origin/gh/pearu/149/orig 2025-12-04T09:43:32.6175617Z * [new branch] gh/pearu/150/base -> origin/gh/pearu/150/base 2025-12-04T09:43:32.6177349Z * [new branch] gh/pearu/150/head -> origin/gh/pearu/150/head 2025-12-04T09:43:32.6179012Z * [new branch] gh/pearu/150/orig -> origin/gh/pearu/150/orig 2025-12-04T09:43:32.6181535Z * [new branch] gh/pearu/151/base -> origin/gh/pearu/151/base 2025-12-04T09:43:32.6183214Z * [new branch] gh/pearu/151/head -> origin/gh/pearu/151/head 2025-12-04T09:43:32.6184887Z * [new branch] gh/pearu/151/orig -> origin/gh/pearu/151/orig 2025-12-04T09:43:32.6187381Z * [new branch] gh/pearu/152/base -> origin/gh/pearu/152/base 2025-12-04T09:43:32.6189138Z * [new branch] gh/pearu/152/head -> origin/gh/pearu/152/head 2025-12-04T09:43:32.6190882Z * [new branch] gh/pearu/152/orig -> origin/gh/pearu/152/orig 2025-12-04T09:43:32.6193687Z * [new branch] gh/pearu/153/base -> origin/gh/pearu/153/base 2025-12-04T09:43:32.6195374Z * [new branch] gh/pearu/153/head -> origin/gh/pearu/153/head 2025-12-04T09:43:32.6197496Z * [new branch] gh/pearu/153/orig -> origin/gh/pearu/153/orig 2025-12-04T09:43:32.6199906Z * [new branch] gh/pearu/154/base -> origin/gh/pearu/154/base 2025-12-04T09:43:32.6201666Z * [new branch] gh/pearu/154/head -> origin/gh/pearu/154/head 2025-12-04T09:43:32.6203344Z * [new branch] gh/pearu/154/orig -> origin/gh/pearu/154/orig 2025-12-04T09:43:32.6205753Z * [new branch] gh/pearu/155/base -> origin/gh/pearu/155/base 2025-12-04T09:43:32.6207464Z * [new branch] gh/pearu/155/head -> origin/gh/pearu/155/head 2025-12-04T09:43:32.6209197Z * [new branch] gh/pearu/155/orig -> origin/gh/pearu/155/orig 2025-12-04T09:43:32.6211521Z * [new branch] gh/pearu/156/base -> origin/gh/pearu/156/base 2025-12-04T09:43:32.6213250Z * [new branch] gh/pearu/156/head -> origin/gh/pearu/156/head 2025-12-04T09:43:32.6215088Z * [new branch] gh/pearu/156/orig -> origin/gh/pearu/156/orig 2025-12-04T09:43:32.6217837Z * [new branch] gh/pearu/56/base -> origin/gh/pearu/56/base 2025-12-04T09:43:32.6219854Z * [new branch] gh/pearu/56/head -> origin/gh/pearu/56/head 2025-12-04T09:43:32.6221456Z * [new branch] gh/pearu/56/orig -> origin/gh/pearu/56/orig 2025-12-04T09:43:32.6223935Z * [new branch] gh/pearu/97/base -> origin/gh/pearu/97/base 2025-12-04T09:43:32.6225748Z * [new branch] gh/pearu/97/head -> origin/gh/pearu/97/head 2025-12-04T09:43:32.6227522Z * [new branch] gh/pearu/97/orig -> origin/gh/pearu/97/orig 2025-12-04T09:43:32.6230406Z * [new branch] gh/pianpwk/21/base -> origin/gh/pianpwk/21/base 2025-12-04T09:43:32.6232095Z * [new branch] gh/pianpwk/21/head -> origin/gh/pianpwk/21/head 2025-12-04T09:43:32.6234452Z * [new branch] gh/pianpwk/28/base -> origin/gh/pianpwk/28/base 2025-12-04T09:43:32.6236084Z * [new branch] gh/pianpwk/28/head -> origin/gh/pianpwk/28/head 2025-12-04T09:43:32.6237962Z * [new branch] gh/pianpwk/28/orig -> origin/gh/pianpwk/28/orig 2025-12-04T09:43:32.6240255Z * [new branch] gh/pianpwk/29/base -> origin/gh/pianpwk/29/base 2025-12-04T09:43:32.6242068Z * [new branch] gh/pianpwk/29/head -> origin/gh/pianpwk/29/head 2025-12-04T09:43:32.6243773Z * [new branch] gh/pianpwk/29/orig -> origin/gh/pianpwk/29/orig 2025-12-04T09:43:32.6246163Z * [new branch] gh/pianpwk/30/base -> origin/gh/pianpwk/30/base 2025-12-04T09:43:32.6247814Z * [new branch] gh/pianpwk/30/head -> origin/gh/pianpwk/30/head 2025-12-04T09:43:32.6249605Z * [new branch] gh/pianpwk/30/orig -> origin/gh/pianpwk/30/orig 2025-12-04T09:43:32.6251942Z * [new branch] gh/pianpwk/31/base -> origin/gh/pianpwk/31/base 2025-12-04T09:43:32.6253651Z * [new branch] gh/pianpwk/31/head -> origin/gh/pianpwk/31/head 2025-12-04T09:43:32.6255500Z * [new branch] gh/pianpwk/31/orig -> origin/gh/pianpwk/31/orig 2025-12-04T09:43:32.6257761Z * [new branch] gh/pianpwk/32/base -> origin/gh/pianpwk/32/base 2025-12-04T09:43:32.6259423Z * [new branch] gh/pianpwk/32/head -> origin/gh/pianpwk/32/head 2025-12-04T09:43:32.6261210Z * [new branch] gh/pianpwk/32/orig -> origin/gh/pianpwk/32/orig 2025-12-04T09:43:32.6263367Z * [new branch] gh/pianpwk/33/base -> origin/gh/pianpwk/33/base 2025-12-04T09:43:32.6265096Z * [new branch] gh/pianpwk/33/head -> origin/gh/pianpwk/33/head 2025-12-04T09:43:32.6266887Z * [new branch] gh/pianpwk/33/orig -> origin/gh/pianpwk/33/orig 2025-12-04T09:43:32.6269977Z * [new branch] gh/pianpwk/34/base -> origin/gh/pianpwk/34/base 2025-12-04T09:43:32.6271948Z * [new branch] gh/pianpwk/34/head -> origin/gh/pianpwk/34/head 2025-12-04T09:43:32.6273822Z * [new branch] gh/pianpwk/34/orig -> origin/gh/pianpwk/34/orig 2025-12-04T09:43:32.6276125Z * [new branch] gh/pianpwk/35/base -> origin/gh/pianpwk/35/base 2025-12-04T09:43:32.6278000Z * [new branch] gh/pianpwk/35/head -> origin/gh/pianpwk/35/head 2025-12-04T09:43:32.6279770Z * [new branch] gh/pianpwk/35/orig -> origin/gh/pianpwk/35/orig 2025-12-04T09:43:32.6282514Z * [new branch] gh/rec/141/base -> origin/gh/rec/141/base 2025-12-04T09:43:32.6284244Z * [new branch] gh/rec/141/head -> origin/gh/rec/141/head 2025-12-04T09:43:32.6286508Z * [new branch] gh/rec/153/base -> origin/gh/rec/153/base 2025-12-04T09:43:32.6288208Z * [new branch] gh/rec/153/head -> origin/gh/rec/153/head 2025-12-04T09:43:32.6289930Z * [new branch] gh/rec/153/orig -> origin/gh/rec/153/orig 2025-12-04T09:43:32.6292326Z * [new branch] gh/rec/154/base -> origin/gh/rec/154/base 2025-12-04T09:43:32.6293942Z * [new branch] gh/rec/154/head -> origin/gh/rec/154/head 2025-12-04T09:43:32.6295654Z * [new branch] gh/rec/154/orig -> origin/gh/rec/154/orig 2025-12-04T09:43:32.6297886Z * [new branch] gh/rec/164/base -> origin/gh/rec/164/base 2025-12-04T09:43:32.6299613Z * [new branch] gh/rec/164/head -> origin/gh/rec/164/head 2025-12-04T09:43:32.6301536Z * [new branch] gh/rec/164/orig -> origin/gh/rec/164/orig 2025-12-04T09:43:32.6303796Z * [new branch] gh/rec/166/base -> origin/gh/rec/166/base 2025-12-04T09:43:32.6305525Z * [new branch] gh/rec/166/head -> origin/gh/rec/166/head 2025-12-04T09:43:32.6307341Z * [new branch] gh/rec/166/orig -> origin/gh/rec/166/orig 2025-12-04T09:43:32.6309677Z * [new branch] gh/rec/167/base -> origin/gh/rec/167/base 2025-12-04T09:43:32.6311358Z * [new branch] gh/rec/167/head -> origin/gh/rec/167/head 2025-12-04T09:43:32.6313133Z * [new branch] gh/rec/167/orig -> origin/gh/rec/167/orig 2025-12-04T09:43:32.6315466Z * [new branch] gh/rec/168/base -> origin/gh/rec/168/base 2025-12-04T09:43:32.6317187Z * [new branch] gh/rec/168/head -> origin/gh/rec/168/head 2025-12-04T09:43:32.6318918Z * [new branch] gh/rec/168/orig -> origin/gh/rec/168/orig 2025-12-04T09:43:32.6321148Z * [new branch] gh/rec/169/base -> origin/gh/rec/169/base 2025-12-04T09:43:32.6322902Z * [new branch] gh/rec/169/head -> origin/gh/rec/169/head 2025-12-04T09:43:32.6324585Z * [new branch] gh/rec/169/orig -> origin/gh/rec/169/orig 2025-12-04T09:43:32.6326910Z * [new branch] gh/rec/170/base -> origin/gh/rec/170/base 2025-12-04T09:43:32.6328564Z * [new branch] gh/rec/170/head -> origin/gh/rec/170/head 2025-12-04T09:43:32.6330356Z * [new branch] gh/rec/170/orig -> origin/gh/rec/170/orig 2025-12-04T09:43:32.6332664Z * [new branch] gh/rec/171/base -> origin/gh/rec/171/base 2025-12-04T09:43:32.6334353Z * [new branch] gh/rec/171/head -> origin/gh/rec/171/head 2025-12-04T09:43:32.6336086Z * [new branch] gh/rec/171/orig -> origin/gh/rec/171/orig 2025-12-04T09:43:32.6338307Z * [new branch] gh/rec/172/base -> origin/gh/rec/172/base 2025-12-04T09:43:32.6340064Z * [new branch] gh/rec/172/head -> origin/gh/rec/172/head 2025-12-04T09:43:32.6341724Z * [new branch] gh/rec/172/orig -> origin/gh/rec/172/orig 2025-12-04T09:43:32.6344018Z * [new branch] gh/rec/173/base -> origin/gh/rec/173/base 2025-12-04T09:43:32.6346170Z * [new branch] gh/rec/173/head -> origin/gh/rec/173/head 2025-12-04T09:43:32.6347980Z * [new branch] gh/rec/173/orig -> origin/gh/rec/173/orig 2025-12-04T09:43:32.6350344Z * [new branch] gh/rec/174/base -> origin/gh/rec/174/base 2025-12-04T09:43:32.6351907Z * [new branch] gh/rec/174/head -> origin/gh/rec/174/head 2025-12-04T09:43:32.6353707Z * [new branch] gh/rec/174/orig -> origin/gh/rec/174/orig 2025-12-04T09:43:32.6356243Z * [new branch] gh/rec/175/base -> origin/gh/rec/175/base 2025-12-04T09:43:32.6358318Z * [new branch] gh/rec/175/head -> origin/gh/rec/175/head 2025-12-04T09:43:32.6360134Z * [new branch] gh/rec/175/orig -> origin/gh/rec/175/orig 2025-12-04T09:43:32.6362580Z * [new branch] gh/rec/176/base -> origin/gh/rec/176/base 2025-12-04T09:43:32.6364170Z * [new branch] gh/rec/176/head -> origin/gh/rec/176/head 2025-12-04T09:43:32.6365820Z * [new branch] gh/rec/176/orig -> origin/gh/rec/176/orig 2025-12-04T09:43:32.6368076Z * [new branch] gh/rec/177/base -> origin/gh/rec/177/base 2025-12-04T09:43:32.6369784Z * [new branch] gh/rec/177/head -> origin/gh/rec/177/head 2025-12-04T09:43:32.6371516Z * [new branch] gh/rec/177/orig -> origin/gh/rec/177/orig 2025-12-04T09:43:32.6374898Z * [new branch] gh/robert-hardwick/3/base -> origin/gh/robert-hardwick/3/base 2025-12-04T09:43:32.6376751Z * [new branch] gh/robert-hardwick/3/head -> origin/gh/robert-hardwick/3/head 2025-12-04T09:43:32.6378467Z * [new branch] gh/robert-hardwick/3/orig -> origin/gh/robert-hardwick/3/orig 2025-12-04T09:43:32.6380889Z * [new branch] gh/robert-hardwick/4/base -> origin/gh/robert-hardwick/4/base 2025-12-04T09:43:32.6382619Z * [new branch] gh/robert-hardwick/4/head -> origin/gh/robert-hardwick/4/head 2025-12-04T09:43:32.6384360Z * [new branch] gh/robert-hardwick/4/orig -> origin/gh/robert-hardwick/4/orig 2025-12-04T09:43:32.6386654Z * [new branch] gh/robert-hardwick/5/base -> origin/gh/robert-hardwick/5/base 2025-12-04T09:43:32.6388477Z * [new branch] gh/robert-hardwick/5/head -> origin/gh/robert-hardwick/5/head 2025-12-04T09:43:32.6390275Z * [new branch] gh/robert-hardwick/5/orig -> origin/gh/robert-hardwick/5/orig 2025-12-04T09:43:32.6392527Z * [new branch] gh/robert-hardwick/6/base -> origin/gh/robert-hardwick/6/base 2025-12-04T09:43:32.6394246Z * [new branch] gh/robert-hardwick/6/head -> origin/gh/robert-hardwick/6/head 2025-12-04T09:43:32.6395960Z * [new branch] gh/robert-hardwick/6/orig -> origin/gh/robert-hardwick/6/orig 2025-12-04T09:43:32.6398284Z * [new branch] gh/robert-hardwick/7/base -> origin/gh/robert-hardwick/7/base 2025-12-04T09:43:32.6399986Z * [new branch] gh/robert-hardwick/7/head -> origin/gh/robert-hardwick/7/head 2025-12-04T09:43:32.6401653Z * [new branch] gh/robert-hardwick/7/orig -> origin/gh/robert-hardwick/7/orig 2025-12-04T09:43:32.6403924Z * [new branch] gh/robert-hardwick/8/base -> origin/gh/robert-hardwick/8/base 2025-12-04T09:43:32.6405653Z * [new branch] gh/robert-hardwick/8/head -> origin/gh/robert-hardwick/8/head 2025-12-04T09:43:32.6407361Z * [new branch] gh/robert-hardwick/8/orig -> origin/gh/robert-hardwick/8/orig 2025-12-04T09:43:32.6409672Z * [new branch] gh/robert-hardwick/9/base -> origin/gh/robert-hardwick/9/base 2025-12-04T09:43:32.6411445Z * [new branch] gh/robert-hardwick/9/head -> origin/gh/robert-hardwick/9/head 2025-12-04T09:43:32.6413154Z * [new branch] gh/robert-hardwick/9/orig -> origin/gh/robert-hardwick/9/orig 2025-12-04T09:43:32.6415952Z * [new branch] gh/rtimpe/1/base -> origin/gh/rtimpe/1/base 2025-12-04T09:43:32.6417646Z * [new branch] gh/rtimpe/1/head -> origin/gh/rtimpe/1/head 2025-12-04T09:43:32.6419910Z * [new branch] gh/rtimpe/2/base -> origin/gh/rtimpe/2/base 2025-12-04T09:43:32.6421594Z * [new branch] gh/rtimpe/2/head -> origin/gh/rtimpe/2/head 2025-12-04T09:43:32.6423879Z * [new branch] gh/rtimpe/22/base -> origin/gh/rtimpe/22/base 2025-12-04T09:43:32.6425604Z * [new branch] gh/rtimpe/22/head -> origin/gh/rtimpe/22/head 2025-12-04T09:43:32.6427390Z * [new branch] gh/rtimpe/22/orig -> origin/gh/rtimpe/22/orig 2025-12-04T09:43:32.6429659Z * [new branch] gh/rtimpe/23/base -> origin/gh/rtimpe/23/base 2025-12-04T09:43:32.6431439Z * [new branch] gh/rtimpe/23/head -> origin/gh/rtimpe/23/head 2025-12-04T09:43:32.6433076Z * [new branch] gh/rtimpe/23/orig -> origin/gh/rtimpe/23/orig 2025-12-04T09:43:32.6435337Z * [new branch] gh/rtimpe/24/base -> origin/gh/rtimpe/24/base 2025-12-04T09:43:32.6436984Z * [new branch] gh/rtimpe/24/head -> origin/gh/rtimpe/24/head 2025-12-04T09:43:32.6438732Z * [new branch] gh/rtimpe/24/orig -> origin/gh/rtimpe/24/orig 2025-12-04T09:43:32.6441037Z * [new branch] gh/rtimpe/25/base -> origin/gh/rtimpe/25/base 2025-12-04T09:43:32.6442731Z * [new branch] gh/rtimpe/25/head -> origin/gh/rtimpe/25/head 2025-12-04T09:43:32.6444463Z * [new branch] gh/rtimpe/25/orig -> origin/gh/rtimpe/25/orig 2025-12-04T09:43:32.6446743Z * [new branch] gh/rtimpe/26/base -> origin/gh/rtimpe/26/base 2025-12-04T09:43:32.6448503Z * [new branch] gh/rtimpe/26/head -> origin/gh/rtimpe/26/head 2025-12-04T09:43:32.6450278Z * [new branch] gh/rtimpe/26/orig -> origin/gh/rtimpe/26/orig 2025-12-04T09:43:32.6453483Z * [new branch] gh/rtimpe/27/base -> origin/gh/rtimpe/27/base 2025-12-04T09:43:32.6455331Z * [new branch] gh/rtimpe/27/head -> origin/gh/rtimpe/27/head 2025-12-04T09:43:32.6457151Z * [new branch] gh/rtimpe/27/orig -> origin/gh/rtimpe/27/orig 2025-12-04T09:43:32.6459419Z * [new branch] gh/rtimpe/28/base -> origin/gh/rtimpe/28/base 2025-12-04T09:43:32.6461013Z * [new branch] gh/rtimpe/28/head -> origin/gh/rtimpe/28/head 2025-12-04T09:43:32.6462754Z * [new branch] gh/rtimpe/28/orig -> origin/gh/rtimpe/28/orig 2025-12-04T09:43:32.6465027Z * [new branch] gh/rtimpe/29/base -> origin/gh/rtimpe/29/base 2025-12-04T09:43:32.6466736Z * [new branch] gh/rtimpe/29/head -> origin/gh/rtimpe/29/head 2025-12-04T09:43:32.6468650Z * [new branch] gh/rtimpe/29/orig -> origin/gh/rtimpe/29/orig 2025-12-04T09:43:32.6471201Z * [new branch] gh/rtimpe/3/base -> origin/gh/rtimpe/3/base 2025-12-04T09:43:32.6472902Z * [new branch] gh/rtimpe/3/head -> origin/gh/rtimpe/3/head 2025-12-04T09:43:32.6475231Z * [new branch] gh/rtimpe/30/base -> origin/gh/rtimpe/30/base 2025-12-04T09:43:32.6476941Z * [new branch] gh/rtimpe/30/head -> origin/gh/rtimpe/30/head 2025-12-04T09:43:32.6478639Z * [new branch] gh/rtimpe/30/orig -> origin/gh/rtimpe/30/orig 2025-12-04T09:43:32.6481287Z * [new branch] gh/rtimpe/31/base -> origin/gh/rtimpe/31/base 2025-12-04T09:43:32.6483023Z * [new branch] gh/rtimpe/31/head -> origin/gh/rtimpe/31/head 2025-12-04T09:43:32.6484793Z * [new branch] gh/rtimpe/31/orig -> origin/gh/rtimpe/31/orig 2025-12-04T09:43:32.6487124Z * [new branch] gh/rtimpe/32/base -> origin/gh/rtimpe/32/base 2025-12-04T09:43:32.6488827Z * [new branch] gh/rtimpe/32/head -> origin/gh/rtimpe/32/head 2025-12-04T09:43:32.6490534Z * [new branch] gh/rtimpe/32/orig -> origin/gh/rtimpe/32/orig 2025-12-04T09:43:32.6493375Z * [new branch] gh/rtimpe/33/base -> origin/gh/rtimpe/33/base 2025-12-04T09:43:32.6495108Z * [new branch] gh/rtimpe/33/head -> origin/gh/rtimpe/33/head 2025-12-04T09:43:32.6496926Z * [new branch] gh/rtimpe/33/orig -> origin/gh/rtimpe/33/orig 2025-12-04T09:43:32.6499113Z * [new branch] gh/rtimpe/34/base -> origin/gh/rtimpe/34/base 2025-12-04T09:43:32.6500818Z * [new branch] gh/rtimpe/34/head -> origin/gh/rtimpe/34/head 2025-12-04T09:43:32.6502775Z * [new branch] gh/rtimpe/34/orig -> origin/gh/rtimpe/34/orig 2025-12-04T09:43:32.6504953Z * [new branch] gh/rtimpe/35/base -> origin/gh/rtimpe/35/base 2025-12-04T09:43:32.6506692Z * [new branch] gh/rtimpe/35/head -> origin/gh/rtimpe/35/head 2025-12-04T09:43:32.6508551Z * [new branch] gh/rtimpe/35/orig -> origin/gh/rtimpe/35/orig 2025-12-04T09:43:32.6510782Z * [new branch] gh/rtimpe/4/base -> origin/gh/rtimpe/4/base 2025-12-04T09:43:32.6512516Z * [new branch] gh/rtimpe/4/head -> origin/gh/rtimpe/4/head 2025-12-04T09:43:32.6515375Z * [new branch] gh/ruisizhang123/1/base -> origin/gh/ruisizhang123/1/base 2025-12-04T09:43:32.6517163Z * [new branch] gh/ruisizhang123/1/head -> origin/gh/ruisizhang123/1/head 2025-12-04T09:43:32.6518873Z * [new branch] gh/ruisizhang123/1/orig -> origin/gh/ruisizhang123/1/orig 2025-12-04T09:43:32.6521163Z * [new branch] gh/ruisizhang123/4/base -> origin/gh/ruisizhang123/4/base 2025-12-04T09:43:32.6522863Z * [new branch] gh/ruisizhang123/4/head -> origin/gh/ruisizhang123/4/head 2025-12-04T09:43:32.6524539Z * [new branch] gh/ruisizhang123/4/orig -> origin/gh/ruisizhang123/4/orig 2025-12-04T09:43:32.6526938Z * [new branch] gh/ruisizhang123/5/base -> origin/gh/ruisizhang123/5/base 2025-12-04T09:43:32.6528642Z * [new branch] gh/ruisizhang123/5/head -> origin/gh/ruisizhang123/5/head 2025-12-04T09:43:32.6530311Z * [new branch] gh/ruisizhang123/5/orig -> origin/gh/ruisizhang123/5/orig 2025-12-04T09:43:32.6532606Z * [new branch] gh/ruisizhang123/6/base -> origin/gh/ruisizhang123/6/base 2025-12-04T09:43:32.6534270Z * [new branch] gh/ruisizhang123/6/head -> origin/gh/ruisizhang123/6/head 2025-12-04T09:43:32.6536020Z * [new branch] gh/ruisizhang123/6/orig -> origin/gh/ruisizhang123/6/orig 2025-12-04T09:43:32.6538413Z * [new branch] gh/ruisizhang123/7/base -> origin/gh/ruisizhang123/7/base 2025-12-04T09:43:32.6540091Z * [new branch] gh/ruisizhang123/7/head -> origin/gh/ruisizhang123/7/head 2025-12-04T09:43:32.6541750Z * [new branch] gh/ruisizhang123/7/orig -> origin/gh/ruisizhang123/7/orig 2025-12-04T09:43:32.6543986Z * [new branch] gh/ruisizhang123/8/base -> origin/gh/ruisizhang123/8/base 2025-12-04T09:43:32.6545674Z * [new branch] gh/ruisizhang123/8/head -> origin/gh/ruisizhang123/8/head 2025-12-04T09:43:32.6547883Z * [new branch] gh/ruisizhang123/8/orig -> origin/gh/ruisizhang123/8/orig 2025-12-04T09:43:32.6550278Z * [new branch] gh/ruisizhang123/9/base -> origin/gh/ruisizhang123/9/base 2025-12-04T09:43:32.6552071Z * [new branch] gh/ruisizhang123/9/head -> origin/gh/ruisizhang123/9/head 2025-12-04T09:43:32.6553750Z * [new branch] gh/ruisizhang123/9/orig -> origin/gh/ruisizhang123/9/orig 2025-12-04T09:43:32.6558244Z * [new branch] gh/seemethere/52/base -> origin/gh/seemethere/52/base 2025-12-04T09:43:32.6559965Z * [new branch] gh/seemethere/52/head -> origin/gh/seemethere/52/head 2025-12-04T09:43:32.6561862Z * [new branch] gh/seemethere/52/orig -> origin/gh/seemethere/52/orig 2025-12-04T09:43:32.6564132Z * [new branch] gh/seemethere/53/base -> origin/gh/seemethere/53/base 2025-12-04T09:43:32.6565839Z * [new branch] gh/seemethere/53/head -> origin/gh/seemethere/53/head 2025-12-04T09:43:32.6567529Z * [new branch] gh/seemethere/53/orig -> origin/gh/seemethere/53/orig 2025-12-04T09:43:32.6569866Z * [new branch] gh/seemethere/54/base -> origin/gh/seemethere/54/base 2025-12-04T09:43:32.6571603Z * [new branch] gh/seemethere/54/head -> origin/gh/seemethere/54/head 2025-12-04T09:43:32.6573452Z * [new branch] gh/seemethere/54/orig -> origin/gh/seemethere/54/orig 2025-12-04T09:43:32.6575610Z * [new branch] gh/seemethere/55/base -> origin/gh/seemethere/55/base 2025-12-04T09:43:32.6577225Z * [new branch] gh/seemethere/55/head -> origin/gh/seemethere/55/head 2025-12-04T09:43:32.6578986Z * [new branch] gh/seemethere/55/orig -> origin/gh/seemethere/55/orig 2025-12-04T09:43:32.6581255Z * [new branch] gh/seemethere/59/base -> origin/gh/seemethere/59/base 2025-12-04T09:43:32.6582895Z * [new branch] gh/seemethere/59/head -> origin/gh/seemethere/59/head 2025-12-04T09:43:32.6584692Z * [new branch] gh/seemethere/59/orig -> origin/gh/seemethere/59/orig 2025-12-04T09:43:32.6586984Z * [new branch] gh/seemethere/62/base -> origin/gh/seemethere/62/base 2025-12-04T09:43:32.6588893Z * [new branch] gh/seemethere/62/head -> origin/gh/seemethere/62/head 2025-12-04T09:43:32.6590649Z * [new branch] gh/seemethere/62/orig -> origin/gh/seemethere/62/orig 2025-12-04T09:43:32.6592929Z * [new branch] gh/seemethere/63/base -> origin/gh/seemethere/63/base 2025-12-04T09:43:32.6594648Z * [new branch] gh/seemethere/63/head -> origin/gh/seemethere/63/head 2025-12-04T09:43:32.6596362Z * [new branch] gh/seemethere/63/orig -> origin/gh/seemethere/63/orig 2025-12-04T09:43:32.6598629Z * [new branch] gh/seemethere/71/base -> origin/gh/seemethere/71/base 2025-12-04T09:43:32.6600298Z * [new branch] gh/seemethere/71/head -> origin/gh/seemethere/71/head 2025-12-04T09:43:32.6601928Z * [new branch] gh/seemethere/71/orig -> origin/gh/seemethere/71/orig 2025-12-04T09:43:32.6604245Z * [new branch] gh/seemethere/72/base -> origin/gh/seemethere/72/base 2025-12-04T09:43:32.6605994Z * [new branch] gh/seemethere/72/head -> origin/gh/seemethere/72/head 2025-12-04T09:43:32.6607815Z * [new branch] gh/seemethere/72/orig -> origin/gh/seemethere/72/orig 2025-12-04T09:43:32.6610085Z * [new branch] gh/seemethere/73/base -> origin/gh/seemethere/73/base 2025-12-04T09:43:32.6611887Z * [new branch] gh/seemethere/73/head -> origin/gh/seemethere/73/head 2025-12-04T09:43:32.6613563Z * [new branch] gh/seemethere/73/orig -> origin/gh/seemethere/73/orig 2025-12-04T09:43:32.6615850Z * [new branch] gh/seemethere/74/base -> origin/gh/seemethere/74/base 2025-12-04T09:43:32.6617521Z * [new branch] gh/seemethere/74/head -> origin/gh/seemethere/74/head 2025-12-04T09:43:32.6619273Z * [new branch] gh/seemethere/74/orig -> origin/gh/seemethere/74/orig 2025-12-04T09:43:32.6621633Z * [new branch] gh/seemethere/75/base -> origin/gh/seemethere/75/base 2025-12-04T09:43:32.6623274Z * [new branch] gh/seemethere/75/head -> origin/gh/seemethere/75/head 2025-12-04T09:43:32.6624989Z * [new branch] gh/seemethere/75/orig -> origin/gh/seemethere/75/orig 2025-12-04T09:43:32.6627310Z * [new branch] gh/seemethere/76/base -> origin/gh/seemethere/76/base 2025-12-04T09:43:32.6629209Z * [new branch] gh/seemethere/76/head -> origin/gh/seemethere/76/head 2025-12-04T09:43:32.6630949Z * [new branch] gh/seemethere/76/orig -> origin/gh/seemethere/76/orig 2025-12-04T09:43:32.6633880Z * [new branch] gh/shunting314/145/base -> origin/gh/shunting314/145/base 2025-12-04T09:43:32.6635675Z * [new branch] gh/shunting314/145/head -> origin/gh/shunting314/145/head 2025-12-04T09:43:32.6637458Z * [new branch] gh/shunting314/145/orig -> origin/gh/shunting314/145/orig 2025-12-04T09:43:32.6640025Z * [new branch] gh/shunting314/176/base -> origin/gh/shunting314/176/base 2025-12-04T09:43:32.6641779Z * [new branch] gh/shunting314/176/head -> origin/gh/shunting314/176/head 2025-12-04T09:43:32.6643514Z * [new branch] gh/shunting314/176/orig -> origin/gh/shunting314/176/orig 2025-12-04T09:43:32.6645849Z * [new branch] gh/shunting314/249/base -> origin/gh/shunting314/249/base 2025-12-04T09:43:32.6647658Z * [new branch] gh/shunting314/249/head -> origin/gh/shunting314/249/head 2025-12-04T09:43:32.6649429Z * [new branch] gh/shunting314/249/orig -> origin/gh/shunting314/249/orig 2025-12-04T09:43:32.6651766Z * [new branch] gh/shunting314/253/base -> origin/gh/shunting314/253/base 2025-12-04T09:43:32.6653490Z * [new branch] gh/shunting314/253/head -> origin/gh/shunting314/253/head 2025-12-04T09:43:32.6655359Z * [new branch] gh/shunting314/253/orig -> origin/gh/shunting314/253/orig 2025-12-04T09:43:32.6657865Z * [new branch] gh/shunting314/256/base -> origin/gh/shunting314/256/base 2025-12-04T09:43:32.6659545Z * [new branch] gh/shunting314/256/head -> origin/gh/shunting314/256/head 2025-12-04T09:43:32.6661254Z * [new branch] gh/shunting314/256/orig -> origin/gh/shunting314/256/orig 2025-12-04T09:43:32.6663838Z * [new branch] gh/shunting314/257/base -> origin/gh/shunting314/257/base 2025-12-04T09:43:32.6665628Z * [new branch] gh/shunting314/257/head -> origin/gh/shunting314/257/head 2025-12-04T09:43:32.6667373Z * [new branch] gh/shunting314/257/orig -> origin/gh/shunting314/257/orig 2025-12-04T09:43:32.6669965Z * [new branch] gh/shunting314/258/base -> origin/gh/shunting314/258/base 2025-12-04T09:43:32.6671612Z * [new branch] gh/shunting314/258/head -> origin/gh/shunting314/258/head 2025-12-04T09:43:32.6673381Z * [new branch] gh/shunting314/258/orig -> origin/gh/shunting314/258/orig 2025-12-04T09:43:32.6675584Z * [new branch] gh/shunting314/259/base -> origin/gh/shunting314/259/base 2025-12-04T09:43:32.6677377Z * [new branch] gh/shunting314/259/head -> origin/gh/shunting314/259/head 2025-12-04T09:43:32.6679076Z * [new branch] gh/shunting314/259/orig -> origin/gh/shunting314/259/orig 2025-12-04T09:43:32.6681491Z * [new branch] gh/shunting314/260/base -> origin/gh/shunting314/260/base 2025-12-04T09:43:32.6683375Z * [new branch] gh/shunting314/260/head -> origin/gh/shunting314/260/head 2025-12-04T09:43:32.6685059Z * [new branch] gh/shunting314/260/orig -> origin/gh/shunting314/260/orig 2025-12-04T09:43:32.6687386Z * [new branch] gh/shunting314/261/base -> origin/gh/shunting314/261/base 2025-12-04T09:43:32.6689699Z * [new branch] gh/shunting314/261/head -> origin/gh/shunting314/261/head 2025-12-04T09:43:32.6691511Z * [new branch] gh/shunting314/261/orig -> origin/gh/shunting314/261/orig 2025-12-04T09:43:32.6693906Z * [new branch] gh/shunting314/262/base -> origin/gh/shunting314/262/base 2025-12-04T09:43:32.6695714Z * [new branch] gh/shunting314/262/head -> origin/gh/shunting314/262/head 2025-12-04T09:43:32.6697473Z * [new branch] gh/shunting314/262/orig -> origin/gh/shunting314/262/orig 2025-12-04T09:43:32.6699869Z * [new branch] gh/shunting314/263/base -> origin/gh/shunting314/263/base 2025-12-04T09:43:32.6701715Z * [new branch] gh/shunting314/263/head -> origin/gh/shunting314/263/head 2025-12-04T09:43:32.6703459Z * [new branch] gh/shunting314/263/orig -> origin/gh/shunting314/263/orig 2025-12-04T09:43:32.6705790Z * [new branch] gh/shunting314/264/base -> origin/gh/shunting314/264/base 2025-12-04T09:43:32.6707882Z * [new branch] gh/shunting314/264/head -> origin/gh/shunting314/264/head 2025-12-04T09:43:32.6709441Z * [new branch] gh/shunting314/264/orig -> origin/gh/shunting314/264/orig 2025-12-04T09:43:32.6711728Z * [new branch] gh/shunting314/265/base -> origin/gh/shunting314/265/base 2025-12-04T09:43:32.6713376Z * [new branch] gh/shunting314/265/head -> origin/gh/shunting314/265/head 2025-12-04T09:43:32.6715117Z * [new branch] gh/shunting314/265/orig -> origin/gh/shunting314/265/orig 2025-12-04T09:43:32.6717546Z * [new branch] gh/shunting314/266/base -> origin/gh/shunting314/266/base 2025-12-04T09:43:32.6719356Z * [new branch] gh/shunting314/266/head -> origin/gh/shunting314/266/head 2025-12-04T09:43:32.6721058Z * [new branch] gh/shunting314/266/orig -> origin/gh/shunting314/266/orig 2025-12-04T09:43:32.6723551Z * [new branch] gh/shunting314/267/base -> origin/gh/shunting314/267/base 2025-12-04T09:43:32.6725428Z * [new branch] gh/shunting314/267/head -> origin/gh/shunting314/267/head 2025-12-04T09:43:32.6727567Z * [new branch] gh/shunting314/267/orig -> origin/gh/shunting314/267/orig 2025-12-04T09:43:32.6730326Z * [new branch] gh/shunting314/268/base -> origin/gh/shunting314/268/base 2025-12-04T09:43:32.6732143Z * [new branch] gh/shunting314/268/head -> origin/gh/shunting314/268/head 2025-12-04T09:43:32.6733881Z * [new branch] gh/shunting314/268/orig -> origin/gh/shunting314/268/orig 2025-12-04T09:43:32.6736257Z * [new branch] gh/shunting314/269/base -> origin/gh/shunting314/269/base 2025-12-04T09:43:32.6737970Z * [new branch] gh/shunting314/269/head -> origin/gh/shunting314/269/head 2025-12-04T09:43:32.6739667Z * [new branch] gh/shunting314/269/orig -> origin/gh/shunting314/269/orig 2025-12-04T09:43:32.6742467Z * [new branch] gh/silverguo/1/base -> origin/gh/silverguo/1/base 2025-12-04T09:43:32.6744177Z * [new branch] gh/silverguo/1/head -> origin/gh/silverguo/1/head 2025-12-04T09:43:32.6746350Z * [new branch] gh/silverguo/2/base -> origin/gh/silverguo/2/base 2025-12-04T09:43:32.6748183Z * [new branch] gh/silverguo/2/head -> origin/gh/silverguo/2/head 2025-12-04T09:43:32.6750336Z * [new branch] gh/silverguo/3/base -> origin/gh/silverguo/3/base 2025-12-04T09:43:32.6752008Z * [new branch] gh/silverguo/3/head -> origin/gh/silverguo/3/head 2025-12-04T09:43:32.6754145Z * [new branch] gh/silverguo/4/base -> origin/gh/silverguo/4/base 2025-12-04T09:43:32.6755947Z * [new branch] gh/silverguo/4/head -> origin/gh/silverguo/4/head 2025-12-04T09:43:32.6758837Z * [new branch] gh/slayton58/39/base -> origin/gh/slayton58/39/base 2025-12-04T09:43:32.6760856Z * [new branch] gh/slayton58/39/head -> origin/gh/slayton58/39/head 2025-12-04T09:43:32.6762641Z * [new branch] gh/slayton58/39/orig -> origin/gh/slayton58/39/orig 2025-12-04T09:43:32.6764885Z * [new branch] gh/slayton58/42/base -> origin/gh/slayton58/42/base 2025-12-04T09:43:32.6766564Z * [new branch] gh/slayton58/42/head -> origin/gh/slayton58/42/head 2025-12-04T09:43:32.6768354Z * [new branch] gh/slayton58/42/orig -> origin/gh/slayton58/42/orig 2025-12-04T09:43:32.6770951Z * [new branch] gh/slayton58/43/base -> origin/gh/slayton58/43/base 2025-12-04T09:43:32.6772670Z * [new branch] gh/slayton58/43/head -> origin/gh/slayton58/43/head 2025-12-04T09:43:32.6774676Z * [new branch] gh/slayton58/43/orig -> origin/gh/slayton58/43/orig 2025-12-04T09:43:32.6777158Z * [new branch] gh/slayton58/44/base -> origin/gh/slayton58/44/base 2025-12-04T09:43:32.6779659Z * [new branch] gh/slayton58/44/head -> origin/gh/slayton58/44/head 2025-12-04T09:43:32.6781284Z * [new branch] gh/slayton58/44/orig -> origin/gh/slayton58/44/orig 2025-12-04T09:43:32.6783504Z * [new branch] gh/slayton58/45/base -> origin/gh/slayton58/45/base 2025-12-04T09:43:32.6785176Z * [new branch] gh/slayton58/45/head -> origin/gh/slayton58/45/head 2025-12-04T09:43:32.6787018Z * [new branch] gh/slayton58/45/orig -> origin/gh/slayton58/45/orig 2025-12-04T09:43:32.6789937Z * [new branch] gh/slayton58/46/base -> origin/gh/slayton58/46/base 2025-12-04T09:43:32.6791701Z * [new branch] gh/slayton58/46/head -> origin/gh/slayton58/46/head 2025-12-04T09:43:32.6793476Z * [new branch] gh/slayton58/46/orig -> origin/gh/slayton58/46/orig 2025-12-04T09:43:32.6795930Z * [new branch] gh/slayton58/6/base -> origin/gh/slayton58/6/base 2025-12-04T09:43:32.6797661Z * [new branch] gh/slayton58/6/head -> origin/gh/slayton58/6/head 2025-12-04T09:43:32.6799787Z * [new branch] gh/slayton58/7/base -> origin/gh/slayton58/7/base 2025-12-04T09:43:32.6801445Z * [new branch] gh/slayton58/7/head -> origin/gh/slayton58/7/head 2025-12-04T09:43:32.6804428Z * [new branch] gh/soulitzer/269/base -> origin/gh/soulitzer/269/base 2025-12-04T09:43:32.6806043Z * [new branch] gh/soulitzer/269/head -> origin/gh/soulitzer/269/head 2025-12-04T09:43:32.6807816Z * [new branch] gh/soulitzer/269/orig -> origin/gh/soulitzer/269/orig 2025-12-04T09:43:32.6810177Z * [new branch] gh/soulitzer/276/base -> origin/gh/soulitzer/276/base 2025-12-04T09:43:32.6811931Z * [new branch] gh/soulitzer/276/head -> origin/gh/soulitzer/276/head 2025-12-04T09:43:32.6813599Z * [new branch] gh/soulitzer/276/orig -> origin/gh/soulitzer/276/orig 2025-12-04T09:43:32.6816260Z * [new branch] gh/soulitzer/287/base -> origin/gh/soulitzer/287/base 2025-12-04T09:43:32.6818330Z * [new branch] gh/soulitzer/287/head -> origin/gh/soulitzer/287/head 2025-12-04T09:43:32.6820047Z * [new branch] gh/soulitzer/287/orig -> origin/gh/soulitzer/287/orig 2025-12-04T09:43:32.6822473Z * [new branch] gh/soulitzer/296/base -> origin/gh/soulitzer/296/base 2025-12-04T09:43:32.6824289Z * [new branch] gh/soulitzer/296/head -> origin/gh/soulitzer/296/head 2025-12-04T09:43:32.6826228Z * [new branch] gh/soulitzer/296/orig -> origin/gh/soulitzer/296/orig 2025-12-04T09:43:32.6828481Z * [new branch] gh/soulitzer/299/base -> origin/gh/soulitzer/299/base 2025-12-04T09:43:32.6830225Z * [new branch] gh/soulitzer/299/head -> origin/gh/soulitzer/299/head 2025-12-04T09:43:32.6832023Z * [new branch] gh/soulitzer/299/orig -> origin/gh/soulitzer/299/orig 2025-12-04T09:43:32.6834444Z * [new branch] gh/soulitzer/300/base -> origin/gh/soulitzer/300/base 2025-12-04T09:43:32.6836163Z * [new branch] gh/soulitzer/300/head -> origin/gh/soulitzer/300/head 2025-12-04T09:43:32.6837866Z * [new branch] gh/soulitzer/300/orig -> origin/gh/soulitzer/300/orig 2025-12-04T09:43:32.6840367Z * [new branch] gh/soulitzer/301/base -> origin/gh/soulitzer/301/base 2025-12-04T09:43:32.6842177Z * [new branch] gh/soulitzer/301/head -> origin/gh/soulitzer/301/head 2025-12-04T09:43:32.6843840Z * [new branch] gh/soulitzer/301/orig -> origin/gh/soulitzer/301/orig 2025-12-04T09:43:32.6846131Z * [new branch] gh/soulitzer/313/base -> origin/gh/soulitzer/313/base 2025-12-04T09:43:32.6847845Z * [new branch] gh/soulitzer/313/head -> origin/gh/soulitzer/313/head 2025-12-04T09:43:32.6849724Z * [new branch] gh/soulitzer/313/orig -> origin/gh/soulitzer/313/orig 2025-12-04T09:43:32.6851982Z * [new branch] gh/soulitzer/319/base -> origin/gh/soulitzer/319/base 2025-12-04T09:43:32.6853690Z * [new branch] gh/soulitzer/319/head -> origin/gh/soulitzer/319/head 2025-12-04T09:43:32.6855486Z * [new branch] gh/soulitzer/319/orig -> origin/gh/soulitzer/319/orig 2025-12-04T09:43:32.6858023Z * [new branch] gh/soulitzer/320/base -> origin/gh/soulitzer/320/base 2025-12-04T09:43:32.6859592Z * [new branch] gh/soulitzer/320/head -> origin/gh/soulitzer/320/head 2025-12-04T09:43:32.6861227Z * [new branch] gh/soulitzer/320/orig -> origin/gh/soulitzer/320/orig 2025-12-04T09:43:32.6863726Z * [new branch] gh/soulitzer/336/base -> origin/gh/soulitzer/336/base 2025-12-04T09:43:32.6865382Z * [new branch] gh/soulitzer/336/head -> origin/gh/soulitzer/336/head 2025-12-04T09:43:32.6867052Z * [new branch] gh/soulitzer/336/orig -> origin/gh/soulitzer/336/orig 2025-12-04T09:43:32.6869989Z * [new branch] gh/soulitzer/347/base -> origin/gh/soulitzer/347/base 2025-12-04T09:43:32.6871632Z * [new branch] gh/soulitzer/347/head -> origin/gh/soulitzer/347/head 2025-12-04T09:43:32.6873366Z * [new branch] gh/soulitzer/347/orig -> origin/gh/soulitzer/347/orig 2025-12-04T09:43:32.6875785Z * [new branch] gh/soulitzer/349/base -> origin/gh/soulitzer/349/base 2025-12-04T09:43:32.6877580Z * [new branch] gh/soulitzer/349/head -> origin/gh/soulitzer/349/head 2025-12-04T09:43:32.6879311Z * [new branch] gh/soulitzer/349/orig -> origin/gh/soulitzer/349/orig 2025-12-04T09:43:32.6881528Z * [new branch] gh/soulitzer/350/base -> origin/gh/soulitzer/350/base 2025-12-04T09:43:32.6883131Z * [new branch] gh/soulitzer/350/head -> origin/gh/soulitzer/350/head 2025-12-04T09:43:32.6884845Z * [new branch] gh/soulitzer/350/orig -> origin/gh/soulitzer/350/orig 2025-12-04T09:43:32.6887223Z * [new branch] gh/soulitzer/351/base -> origin/gh/soulitzer/351/base 2025-12-04T09:43:32.6889006Z * [new branch] gh/soulitzer/351/head -> origin/gh/soulitzer/351/head 2025-12-04T09:43:32.6890684Z * [new branch] gh/soulitzer/351/orig -> origin/gh/soulitzer/351/orig 2025-12-04T09:43:32.6893018Z * [new branch] gh/soulitzer/353/base -> origin/gh/soulitzer/353/base 2025-12-04T09:43:32.6894796Z * [new branch] gh/soulitzer/353/head -> origin/gh/soulitzer/353/head 2025-12-04T09:43:32.6896538Z * [new branch] gh/soulitzer/353/orig -> origin/gh/soulitzer/353/orig 2025-12-04T09:43:32.6899509Z * [new branch] gh/soulitzer/358/base -> origin/gh/soulitzer/358/base 2025-12-04T09:43:32.6901306Z * [new branch] gh/soulitzer/358/head -> origin/gh/soulitzer/358/head 2025-12-04T09:43:32.6902988Z * [new branch] gh/soulitzer/358/orig -> origin/gh/soulitzer/358/orig 2025-12-04T09:43:32.6905850Z * [new branch] gh/soulitzer/359/base -> origin/gh/soulitzer/359/base 2025-12-04T09:43:32.6907668Z * [new branch] gh/soulitzer/359/head -> origin/gh/soulitzer/359/head 2025-12-04T09:43:32.6909434Z * [new branch] gh/soulitzer/359/orig -> origin/gh/soulitzer/359/orig 2025-12-04T09:43:32.6911885Z * [new branch] gh/soulitzer/374/base -> origin/gh/soulitzer/374/base 2025-12-04T09:43:32.6913754Z * [new branch] gh/soulitzer/374/head -> origin/gh/soulitzer/374/head 2025-12-04T09:43:32.6915473Z * [new branch] gh/soulitzer/374/orig -> origin/gh/soulitzer/374/orig 2025-12-04T09:43:32.6917860Z * [new branch] gh/soulitzer/375/base -> origin/gh/soulitzer/375/base 2025-12-04T09:43:32.6919622Z * [new branch] gh/soulitzer/375/head -> origin/gh/soulitzer/375/head 2025-12-04T09:43:32.6921277Z * [new branch] gh/soulitzer/375/orig -> origin/gh/soulitzer/375/orig 2025-12-04T09:43:32.6923638Z * [new branch] gh/soulitzer/380/base -> origin/gh/soulitzer/380/base 2025-12-04T09:43:32.6925389Z * [new branch] gh/soulitzer/380/head -> origin/gh/soulitzer/380/head 2025-12-04T09:43:32.6927089Z * [new branch] gh/soulitzer/380/orig -> origin/gh/soulitzer/380/orig 2025-12-04T09:43:32.6929408Z * [new branch] gh/soulitzer/385/base -> origin/gh/soulitzer/385/base 2025-12-04T09:43:32.6931125Z * [new branch] gh/soulitzer/385/head -> origin/gh/soulitzer/385/head 2025-12-04T09:43:32.6932825Z * [new branch] gh/soulitzer/385/orig -> origin/gh/soulitzer/385/orig 2025-12-04T09:43:32.6935258Z * [new branch] gh/soulitzer/386/base -> origin/gh/soulitzer/386/base 2025-12-04T09:43:32.6936938Z * [new branch] gh/soulitzer/386/head -> origin/gh/soulitzer/386/head 2025-12-04T09:43:32.6938725Z * [new branch] gh/soulitzer/386/orig -> origin/gh/soulitzer/386/orig 2025-12-04T09:43:32.6941146Z * [new branch] gh/soulitzer/387/base -> origin/gh/soulitzer/387/base 2025-12-04T09:43:32.6942857Z * [new branch] gh/soulitzer/387/head -> origin/gh/soulitzer/387/head 2025-12-04T09:43:32.6944562Z * [new branch] gh/soulitzer/387/orig -> origin/gh/soulitzer/387/orig 2025-12-04T09:43:32.6946906Z * [new branch] gh/soulitzer/388/base -> origin/gh/soulitzer/388/base 2025-12-04T09:43:32.6948752Z * [new branch] gh/soulitzer/388/head -> origin/gh/soulitzer/388/head 2025-12-04T09:43:32.6950489Z * [new branch] gh/soulitzer/388/orig -> origin/gh/soulitzer/388/orig 2025-12-04T09:43:32.6952767Z * [new branch] gh/soulitzer/389/base -> origin/gh/soulitzer/389/base 2025-12-04T09:43:32.6954466Z * [new branch] gh/soulitzer/389/head -> origin/gh/soulitzer/389/head 2025-12-04T09:43:32.6956498Z * [new branch] gh/soulitzer/389/orig -> origin/gh/soulitzer/389/orig 2025-12-04T09:43:32.6958927Z * [new branch] gh/soulitzer/390/base -> origin/gh/soulitzer/390/base 2025-12-04T09:43:32.6960533Z * [new branch] gh/soulitzer/390/head -> origin/gh/soulitzer/390/head 2025-12-04T09:43:32.6962323Z * [new branch] gh/soulitzer/390/orig -> origin/gh/soulitzer/390/orig 2025-12-04T09:43:32.6964654Z * [new branch] gh/soulitzer/391/base -> origin/gh/soulitzer/391/base 2025-12-04T09:43:32.6966401Z * [new branch] gh/soulitzer/391/head -> origin/gh/soulitzer/391/head 2025-12-04T09:43:32.6968076Z * [new branch] gh/soulitzer/391/orig -> origin/gh/soulitzer/391/orig 2025-12-04T09:43:32.6970439Z * [new branch] gh/soulitzer/392/base -> origin/gh/soulitzer/392/base 2025-12-04T09:43:32.6972113Z * [new branch] gh/soulitzer/392/head -> origin/gh/soulitzer/392/head 2025-12-04T09:43:32.6973852Z * [new branch] gh/soulitzer/392/orig -> origin/gh/soulitzer/392/orig 2025-12-04T09:43:32.6976582Z * [new branch] gh/swolchok/728/next -> origin/gh/swolchok/728/next 2025-12-04T09:43:32.6979119Z * [new branch] gh/swolchok/819/base -> origin/gh/swolchok/819/base 2025-12-04T09:43:32.6980930Z * [new branch] gh/swolchok/819/head -> origin/gh/swolchok/819/head 2025-12-04T09:43:32.6982759Z * [new branch] gh/swolchok/819/orig -> origin/gh/swolchok/819/orig 2025-12-04T09:43:32.6985030Z * [new branch] gh/swolchok/824/base -> origin/gh/swolchok/824/base 2025-12-04T09:43:32.6986865Z * [new branch] gh/swolchok/824/head -> origin/gh/swolchok/824/head 2025-12-04T09:43:32.6988609Z * [new branch] gh/swolchok/824/orig -> origin/gh/swolchok/824/orig 2025-12-04T09:43:32.6990931Z * [new branch] gh/swolchok/829/base -> origin/gh/swolchok/829/base 2025-12-04T09:43:32.6992524Z * [new branch] gh/swolchok/829/head -> origin/gh/swolchok/829/head 2025-12-04T09:43:32.6994346Z * [new branch] gh/swolchok/829/orig -> origin/gh/swolchok/829/orig 2025-12-04T09:43:32.6996720Z * [new branch] gh/swolchok/839/base -> origin/gh/swolchok/839/base 2025-12-04T09:43:32.6998333Z * [new branch] gh/swolchok/839/head -> origin/gh/swolchok/839/head 2025-12-04T09:43:32.7000056Z * [new branch] gh/swolchok/839/orig -> origin/gh/swolchok/839/orig 2025-12-04T09:43:32.7002373Z * [new branch] gh/swolchok/841/base -> origin/gh/swolchok/841/base 2025-12-04T09:43:32.7004183Z * [new branch] gh/swolchok/841/head -> origin/gh/swolchok/841/head 2025-12-04T09:43:32.7005988Z * [new branch] gh/swolchok/841/orig -> origin/gh/swolchok/841/orig 2025-12-04T09:43:32.7008277Z * [new branch] gh/swolchok/842/base -> origin/gh/swolchok/842/base 2025-12-04T09:43:32.7010036Z * [new branch] gh/swolchok/842/head -> origin/gh/swolchok/842/head 2025-12-04T09:43:32.7011696Z * [new branch] gh/swolchok/842/orig -> origin/gh/swolchok/842/orig 2025-12-04T09:43:32.7014016Z * [new branch] gh/swolchok/845/base -> origin/gh/swolchok/845/base 2025-12-04T09:43:32.7015709Z * [new branch] gh/swolchok/845/head -> origin/gh/swolchok/845/head 2025-12-04T09:43:32.7017457Z * [new branch] gh/swolchok/845/orig -> origin/gh/swolchok/845/orig 2025-12-04T09:43:32.7019789Z * [new branch] gh/swolchok/848/base -> origin/gh/swolchok/848/base 2025-12-04T09:43:32.7022089Z * [new branch] gh/swolchok/848/head -> origin/gh/swolchok/848/head 2025-12-04T09:43:32.7023843Z * [new branch] gh/swolchok/848/orig -> origin/gh/swolchok/848/orig 2025-12-04T09:43:32.7026108Z * [new branch] gh/swolchok/856/base -> origin/gh/swolchok/856/base 2025-12-04T09:43:32.7027993Z * [new branch] gh/swolchok/856/head -> origin/gh/swolchok/856/head 2025-12-04T09:43:32.7029706Z * [new branch] gh/swolchok/856/orig -> origin/gh/swolchok/856/orig 2025-12-04T09:43:32.7032063Z * [new branch] gh/swolchok/860/base -> origin/gh/swolchok/860/base 2025-12-04T09:43:32.7033786Z * [new branch] gh/swolchok/860/head -> origin/gh/swolchok/860/head 2025-12-04T09:43:32.7035432Z * [new branch] gh/swolchok/860/orig -> origin/gh/swolchok/860/orig 2025-12-04T09:43:32.7037897Z * [new branch] gh/swolchok/861/base -> origin/gh/swolchok/861/base 2025-12-04T09:43:32.7039652Z * [new branch] gh/swolchok/861/head -> origin/gh/swolchok/861/head 2025-12-04T09:43:32.7041365Z * [new branch] gh/swolchok/861/orig -> origin/gh/swolchok/861/orig 2025-12-04T09:43:32.7043772Z * [new branch] gh/swolchok/862/base -> origin/gh/swolchok/862/base 2025-12-04T09:43:32.7045386Z * [new branch] gh/swolchok/862/head -> origin/gh/swolchok/862/head 2025-12-04T09:43:32.7047055Z * [new branch] gh/swolchok/862/orig -> origin/gh/swolchok/862/orig 2025-12-04T09:43:32.7049663Z * [new branch] gh/swolchok/863/base -> origin/gh/swolchok/863/base 2025-12-04T09:43:32.7051448Z * [new branch] gh/swolchok/863/head -> origin/gh/swolchok/863/head 2025-12-04T09:43:32.7053241Z * [new branch] gh/swolchok/863/orig -> origin/gh/swolchok/863/orig 2025-12-04T09:43:32.7055868Z * [new branch] gh/swolchok/864/base -> origin/gh/swolchok/864/base 2025-12-04T09:43:32.7058873Z * [new branch] gh/swolchok/864/head -> origin/gh/swolchok/864/head 2025-12-04T09:43:32.7060611Z * [new branch] gh/swolchok/864/orig -> origin/gh/swolchok/864/orig 2025-12-04T09:43:32.7062960Z * [new branch] gh/swolchok/865/base -> origin/gh/swolchok/865/base 2025-12-04T09:43:32.7064834Z * [new branch] gh/swolchok/865/head -> origin/gh/swolchok/865/head 2025-12-04T09:43:32.7066599Z * [new branch] gh/swolchok/865/orig -> origin/gh/swolchok/865/orig 2025-12-04T09:43:32.7069623Z * [new branch] gh/swolchok/866/base -> origin/gh/swolchok/866/base 2025-12-04T09:43:32.7071416Z * [new branch] gh/swolchok/866/head -> origin/gh/swolchok/866/head 2025-12-04T09:43:32.7073128Z * [new branch] gh/swolchok/866/orig -> origin/gh/swolchok/866/orig 2025-12-04T09:43:32.7075524Z * [new branch] gh/swolchok/867/base -> origin/gh/swolchok/867/base 2025-12-04T09:43:32.7077339Z * [new branch] gh/swolchok/867/head -> origin/gh/swolchok/867/head 2025-12-04T09:43:32.7079001Z * [new branch] gh/swolchok/867/orig -> origin/gh/swolchok/867/orig 2025-12-04T09:43:32.7081293Z * [new branch] gh/swolchok/868/base -> origin/gh/swolchok/868/base 2025-12-04T09:43:32.7082991Z * [new branch] gh/swolchok/868/head -> origin/gh/swolchok/868/head 2025-12-04T09:43:32.7084783Z * [new branch] gh/swolchok/868/orig -> origin/gh/swolchok/868/orig 2025-12-04T09:43:32.7087137Z * [new branch] gh/swolchok/869/base -> origin/gh/swolchok/869/base 2025-12-04T09:43:32.7088914Z * [new branch] gh/swolchok/869/head -> origin/gh/swolchok/869/head 2025-12-04T09:43:32.7090652Z * [new branch] gh/swolchok/869/orig -> origin/gh/swolchok/869/orig 2025-12-04T09:43:32.7093587Z * [new branch] gh/swolchok/870/base -> origin/gh/swolchok/870/base 2025-12-04T09:43:32.7095249Z * [new branch] gh/swolchok/870/head -> origin/gh/swolchok/870/head 2025-12-04T09:43:32.7096965Z * [new branch] gh/swolchok/870/orig -> origin/gh/swolchok/870/orig 2025-12-04T09:43:32.7099383Z * [new branch] gh/swolchok/871/base -> origin/gh/swolchok/871/base 2025-12-04T09:43:32.7101286Z * [new branch] gh/swolchok/871/head -> origin/gh/swolchok/871/head 2025-12-04T09:43:32.7103045Z * [new branch] gh/swolchok/871/orig -> origin/gh/swolchok/871/orig 2025-12-04T09:43:32.7106011Z * [new branch] gh/teja-rao/4/base -> origin/gh/teja-rao/4/base 2025-12-04T09:43:32.7107826Z * [new branch] gh/teja-rao/4/head -> origin/gh/teja-rao/4/head 2025-12-04T09:43:32.7109602Z * [new branch] gh/teja-rao/4/orig -> origin/gh/teja-rao/4/orig 2025-12-04T09:43:32.7112324Z * [new branch] gh/tianyu-l/2/base -> origin/gh/tianyu-l/2/base 2025-12-04T09:43:32.7114043Z * [new branch] gh/tianyu-l/2/head -> origin/gh/tianyu-l/2/head 2025-12-04T09:43:32.7115855Z * [new branch] gh/tianyu-l/2/orig -> origin/gh/tianyu-l/2/orig 2025-12-04T09:43:32.7118208Z * [new branch] gh/tianyu-l/3/base -> origin/gh/tianyu-l/3/base 2025-12-04T09:43:32.7120007Z * [new branch] gh/tianyu-l/3/orig -> origin/gh/tianyu-l/3/orig 2025-12-04T09:43:32.7122410Z * [new branch] gh/tianyu-l/4/base -> origin/gh/tianyu-l/4/base 2025-12-04T09:43:32.7124112Z * [new branch] gh/tianyu-l/4/head -> origin/gh/tianyu-l/4/head 2025-12-04T09:43:32.7125816Z * [new branch] gh/tianyu-l/4/orig -> origin/gh/tianyu-l/4/orig 2025-12-04T09:43:32.7129011Z * [new branch] gh/tugsbayasgalan/10/base -> origin/gh/tugsbayasgalan/10/base 2025-12-04T09:43:32.7130673Z * [new branch] gh/tugsbayasgalan/10/head -> origin/gh/tugsbayasgalan/10/head 2025-12-04T09:43:32.7132435Z * [new branch] gh/tugsbayasgalan/10/orig -> origin/gh/tugsbayasgalan/10/orig 2025-12-04T09:43:32.7135222Z * [new branch] gh/tugsbayasgalan/13/base -> origin/gh/tugsbayasgalan/13/base 2025-12-04T09:43:32.7136922Z * [new branch] gh/tugsbayasgalan/13/head -> origin/gh/tugsbayasgalan/13/head 2025-12-04T09:43:32.7138639Z * [new branch] gh/tugsbayasgalan/13/orig -> origin/gh/tugsbayasgalan/13/orig 2025-12-04T09:43:32.7140936Z * [new branch] gh/tugsbayasgalan/17/base -> origin/gh/tugsbayasgalan/17/base 2025-12-04T09:43:32.7142625Z * [new branch] gh/tugsbayasgalan/17/head -> origin/gh/tugsbayasgalan/17/head 2025-12-04T09:43:32.7144340Z * [new branch] gh/tugsbayasgalan/17/orig -> origin/gh/tugsbayasgalan/17/orig 2025-12-04T09:43:32.7146807Z * [new branch] gh/tugsbayasgalan/2/base -> origin/gh/tugsbayasgalan/2/base 2025-12-04T09:43:32.7148649Z * [new branch] gh/tugsbayasgalan/2/head -> origin/gh/tugsbayasgalan/2/head 2025-12-04T09:43:32.7150305Z * [new branch] gh/tugsbayasgalan/2/orig -> origin/gh/tugsbayasgalan/2/orig 2025-12-04T09:43:32.7152917Z * [new branch] gh/tugsbayasgalan/28/base -> origin/gh/tugsbayasgalan/28/base 2025-12-04T09:43:32.7154567Z * [new branch] gh/tugsbayasgalan/28/head -> origin/gh/tugsbayasgalan/28/head 2025-12-04T09:43:32.7156599Z * [new branch] gh/tugsbayasgalan/28/orig -> origin/gh/tugsbayasgalan/28/orig 2025-12-04T09:43:32.7158893Z * [new branch] gh/tugsbayasgalan/32/base -> origin/gh/tugsbayasgalan/32/base 2025-12-04T09:43:32.7160564Z * [new branch] gh/tugsbayasgalan/32/head -> origin/gh/tugsbayasgalan/32/head 2025-12-04T09:43:32.7162349Z * [new branch] gh/tugsbayasgalan/32/orig -> origin/gh/tugsbayasgalan/32/orig 2025-12-04T09:43:32.7165190Z * [new branch] gh/tugsbayasgalan/35/base -> origin/gh/tugsbayasgalan/35/base 2025-12-04T09:43:32.7166995Z * [new branch] gh/tugsbayasgalan/35/head -> origin/gh/tugsbayasgalan/35/head 2025-12-04T09:43:32.7168736Z * [new branch] gh/tugsbayasgalan/35/orig -> origin/gh/tugsbayasgalan/35/orig 2025-12-04T09:43:32.7171221Z * [new branch] gh/tugsbayasgalan/36/base -> origin/gh/tugsbayasgalan/36/base 2025-12-04T09:43:32.7172971Z * [new branch] gh/tugsbayasgalan/36/head -> origin/gh/tugsbayasgalan/36/head 2025-12-04T09:43:32.7174648Z * [new branch] gh/tugsbayasgalan/36/orig -> origin/gh/tugsbayasgalan/36/orig 2025-12-04T09:43:32.7177482Z * [new branch] gh/tugsbayasgalan/37/base -> origin/gh/tugsbayasgalan/37/base 2025-12-04T09:43:32.7179243Z * [new branch] gh/tugsbayasgalan/37/head -> origin/gh/tugsbayasgalan/37/head 2025-12-04T09:43:32.7180882Z * [new branch] gh/tugsbayasgalan/37/orig -> origin/gh/tugsbayasgalan/37/orig 2025-12-04T09:43:32.7183235Z * [new branch] gh/tugsbayasgalan/43/base -> origin/gh/tugsbayasgalan/43/base 2025-12-04T09:43:32.7184926Z * [new branch] gh/tugsbayasgalan/43/head -> origin/gh/tugsbayasgalan/43/head 2025-12-04T09:43:32.7186650Z * [new branch] gh/tugsbayasgalan/43/orig -> origin/gh/tugsbayasgalan/43/orig 2025-12-04T09:43:32.7188969Z * [new branch] gh/tugsbayasgalan/48/base -> origin/gh/tugsbayasgalan/48/base 2025-12-04T09:43:32.7190712Z * [new branch] gh/tugsbayasgalan/48/head -> origin/gh/tugsbayasgalan/48/head 2025-12-04T09:43:32.7192397Z * [new branch] gh/tugsbayasgalan/48/orig -> origin/gh/tugsbayasgalan/48/orig 2025-12-04T09:43:32.7194835Z * [new branch] gh/tugsbayasgalan/51/base -> origin/gh/tugsbayasgalan/51/base 2025-12-04T09:43:32.7196723Z * [new branch] gh/tugsbayasgalan/51/head -> origin/gh/tugsbayasgalan/51/head 2025-12-04T09:43:32.7198401Z * [new branch] gh/tugsbayasgalan/51/orig -> origin/gh/tugsbayasgalan/51/orig 2025-12-04T09:43:32.7200628Z * [new branch] gh/tugsbayasgalan/52/base -> origin/gh/tugsbayasgalan/52/base 2025-12-04T09:43:32.7202384Z * [new branch] gh/tugsbayasgalan/52/head -> origin/gh/tugsbayasgalan/52/head 2025-12-04T09:43:32.7204108Z * [new branch] gh/tugsbayasgalan/52/orig -> origin/gh/tugsbayasgalan/52/orig 2025-12-04T09:43:32.7206457Z * [new branch] gh/tugsbayasgalan/53/base -> origin/gh/tugsbayasgalan/53/base 2025-12-04T09:43:32.7208182Z * [new branch] gh/tugsbayasgalan/53/head -> origin/gh/tugsbayasgalan/53/head 2025-12-04T09:43:32.7209934Z * [new branch] gh/tugsbayasgalan/53/orig -> origin/gh/tugsbayasgalan/53/orig 2025-12-04T09:43:32.7212292Z * [new branch] gh/tugsbayasgalan/55/base -> origin/gh/tugsbayasgalan/55/base 2025-12-04T09:43:32.7214651Z * [new branch] gh/tugsbayasgalan/55/head -> origin/gh/tugsbayasgalan/55/head 2025-12-04T09:43:32.7216399Z * [new branch] gh/tugsbayasgalan/55/orig -> origin/gh/tugsbayasgalan/55/orig 2025-12-04T09:43:32.7219026Z * [new branch] gh/tugsbayasgalan/59/base -> origin/gh/tugsbayasgalan/59/base 2025-12-04T09:43:32.7220844Z * [new branch] gh/tugsbayasgalan/59/head -> origin/gh/tugsbayasgalan/59/head 2025-12-04T09:43:32.7222582Z * [new branch] gh/tugsbayasgalan/59/orig -> origin/gh/tugsbayasgalan/59/orig 2025-12-04T09:43:32.7224864Z * [new branch] gh/tugsbayasgalan/6/base -> origin/gh/tugsbayasgalan/6/base 2025-12-04T09:43:32.7226573Z * [new branch] gh/tugsbayasgalan/6/head -> origin/gh/tugsbayasgalan/6/head 2025-12-04T09:43:32.7228405Z * [new branch] gh/tugsbayasgalan/6/orig -> origin/gh/tugsbayasgalan/6/orig 2025-12-04T09:43:32.7230659Z * [new branch] gh/tugsbayasgalan/60/base -> origin/gh/tugsbayasgalan/60/base 2025-12-04T09:43:32.7232416Z * [new branch] gh/tugsbayasgalan/60/head -> origin/gh/tugsbayasgalan/60/head 2025-12-04T09:43:32.7234128Z * [new branch] gh/tugsbayasgalan/60/orig -> origin/gh/tugsbayasgalan/60/orig 2025-12-04T09:43:32.7236942Z * [new branch] gh/tugsbayasgalan/61/base -> origin/gh/tugsbayasgalan/61/base 2025-12-04T09:43:32.7238622Z * [new branch] gh/tugsbayasgalan/61/head -> origin/gh/tugsbayasgalan/61/head 2025-12-04T09:43:32.7240222Z * [new branch] gh/tugsbayasgalan/61/orig -> origin/gh/tugsbayasgalan/61/orig 2025-12-04T09:43:32.7242784Z * [new branch] gh/tugsbayasgalan/63/base -> origin/gh/tugsbayasgalan/63/base 2025-12-04T09:43:32.7244446Z * [new branch] gh/tugsbayasgalan/63/head -> origin/gh/tugsbayasgalan/63/head 2025-12-04T09:43:32.7246169Z * [new branch] gh/tugsbayasgalan/63/orig -> origin/gh/tugsbayasgalan/63/orig 2025-12-04T09:43:32.7248496Z * [new branch] gh/tugsbayasgalan/67/base -> origin/gh/tugsbayasgalan/67/base 2025-12-04T09:43:32.7250343Z * [new branch] gh/tugsbayasgalan/67/head -> origin/gh/tugsbayasgalan/67/head 2025-12-04T09:43:32.7252035Z * [new branch] gh/tugsbayasgalan/67/orig -> origin/gh/tugsbayasgalan/67/orig 2025-12-04T09:43:32.7254541Z * [new branch] gh/tugsbayasgalan/68/base -> origin/gh/tugsbayasgalan/68/base 2025-12-04T09:43:32.7257098Z * [new branch] gh/tugsbayasgalan/68/head -> origin/gh/tugsbayasgalan/68/head 2025-12-04T09:43:32.7259165Z * [new branch] gh/tugsbayasgalan/68/orig -> origin/gh/tugsbayasgalan/68/orig 2025-12-04T09:43:32.7261945Z * [new branch] gh/tugsbayasgalan/7/base -> origin/gh/tugsbayasgalan/7/base 2025-12-04T09:43:32.7263657Z * [new branch] gh/tugsbayasgalan/7/head -> origin/gh/tugsbayasgalan/7/head 2025-12-04T09:43:32.7265562Z * [new branch] gh/tugsbayasgalan/7/orig -> origin/gh/tugsbayasgalan/7/orig 2025-12-04T09:43:32.7268234Z * [new branch] gh/tugsbayasgalan/70/base -> origin/gh/tugsbayasgalan/70/base 2025-12-04T09:43:32.7270099Z * [new branch] gh/tugsbayasgalan/70/head -> origin/gh/tugsbayasgalan/70/head 2025-12-04T09:43:32.7271729Z * [new branch] gh/tugsbayasgalan/70/orig -> origin/gh/tugsbayasgalan/70/orig 2025-12-04T09:43:32.7274399Z * [new branch] gh/tugsbayasgalan/71/base -> origin/gh/tugsbayasgalan/71/base 2025-12-04T09:43:32.7276275Z * [new branch] gh/tugsbayasgalan/71/head -> origin/gh/tugsbayasgalan/71/head 2025-12-04T09:43:32.7278040Z * [new branch] gh/tugsbayasgalan/71/orig -> origin/gh/tugsbayasgalan/71/orig 2025-12-04T09:43:32.7280561Z * [new branch] gh/tugsbayasgalan/72/base -> origin/gh/tugsbayasgalan/72/base 2025-12-04T09:43:32.7282283Z * [new branch] gh/tugsbayasgalan/72/head -> origin/gh/tugsbayasgalan/72/head 2025-12-04T09:43:32.7284007Z * [new branch] gh/tugsbayasgalan/72/orig -> origin/gh/tugsbayasgalan/72/orig 2025-12-04T09:43:32.7286393Z * [new branch] gh/tugsbayasgalan/73/base -> origin/gh/tugsbayasgalan/73/base 2025-12-04T09:43:32.7288252Z * [new branch] gh/tugsbayasgalan/73/head -> origin/gh/tugsbayasgalan/73/head 2025-12-04T09:43:32.7290036Z * [new branch] gh/tugsbayasgalan/73/orig -> origin/gh/tugsbayasgalan/73/orig 2025-12-04T09:43:32.7292716Z * [new branch] gh/tugsbayasgalan/74/base -> origin/gh/tugsbayasgalan/74/base 2025-12-04T09:43:32.7294477Z * [new branch] gh/tugsbayasgalan/74/head -> origin/gh/tugsbayasgalan/74/head 2025-12-04T09:43:32.7296242Z * [new branch] gh/tugsbayasgalan/74/orig -> origin/gh/tugsbayasgalan/74/orig 2025-12-04T09:43:32.7298667Z * [new branch] gh/tugsbayasgalan/75/base -> origin/gh/tugsbayasgalan/75/base 2025-12-04T09:43:32.7300329Z * [new branch] gh/tugsbayasgalan/75/head -> origin/gh/tugsbayasgalan/75/head 2025-12-04T09:43:32.7302074Z * [new branch] gh/tugsbayasgalan/75/orig -> origin/gh/tugsbayasgalan/75/orig 2025-12-04T09:43:32.7304323Z * [new branch] gh/tugsbayasgalan/76/base -> origin/gh/tugsbayasgalan/76/base 2025-12-04T09:43:32.7306095Z * [new branch] gh/tugsbayasgalan/76/head -> origin/gh/tugsbayasgalan/76/head 2025-12-04T09:43:32.7307917Z * [new branch] gh/tugsbayasgalan/76/orig -> origin/gh/tugsbayasgalan/76/orig 2025-12-04T09:43:32.7310441Z * [new branch] gh/tugsbayasgalan/77/base -> origin/gh/tugsbayasgalan/77/base 2025-12-04T09:43:32.7312132Z * [new branch] gh/tugsbayasgalan/77/head -> origin/gh/tugsbayasgalan/77/head 2025-12-04T09:43:32.7313863Z * [new branch] gh/tugsbayasgalan/77/orig -> origin/gh/tugsbayasgalan/77/orig 2025-12-04T09:43:32.7316352Z * [new branch] gh/tugsbayasgalan/78/base -> origin/gh/tugsbayasgalan/78/base 2025-12-04T09:43:32.7318202Z * [new branch] gh/tugsbayasgalan/78/head -> origin/gh/tugsbayasgalan/78/head 2025-12-04T09:43:32.7319888Z * [new branch] gh/tugsbayasgalan/78/orig -> origin/gh/tugsbayasgalan/78/orig 2025-12-04T09:43:32.7322841Z * [new branch] gh/tugsbayasgalan/79/base -> origin/gh/tugsbayasgalan/79/base 2025-12-04T09:43:32.7324528Z * [new branch] gh/tugsbayasgalan/79/head -> origin/gh/tugsbayasgalan/79/head 2025-12-04T09:43:32.7326218Z * [new branch] gh/tugsbayasgalan/79/orig -> origin/gh/tugsbayasgalan/79/orig 2025-12-04T09:43:32.7328807Z * [new branch] gh/tugsbayasgalan/8/base -> origin/gh/tugsbayasgalan/8/base 2025-12-04T09:43:32.7330562Z * [new branch] gh/tugsbayasgalan/8/head -> origin/gh/tugsbayasgalan/8/head 2025-12-04T09:43:32.7332666Z * [new branch] gh/tugsbayasgalan/8/orig -> origin/gh/tugsbayasgalan/8/orig 2025-12-04T09:43:32.7334689Z * [new branch] gh/tugsbayasgalan/80/base -> origin/gh/tugsbayasgalan/80/base 2025-12-04T09:43:32.7336385Z * [new branch] gh/tugsbayasgalan/80/head -> origin/gh/tugsbayasgalan/80/head 2025-12-04T09:43:32.7338070Z * [new branch] gh/tugsbayasgalan/80/orig -> origin/gh/tugsbayasgalan/80/orig 2025-12-04T09:43:32.7340445Z * [new branch] gh/tugsbayasgalan/81/base -> origin/gh/tugsbayasgalan/81/base 2025-12-04T09:43:32.7354396Z * [new branch] gh/tugsbayasgalan/81/head -> origin/gh/tugsbayasgalan/81/head 2025-12-04T09:43:32.7354771Z * [new branch] gh/tugsbayasgalan/81/orig -> origin/gh/tugsbayasgalan/81/orig 2025-12-04T09:43:32.7354970Z * [new branch] gh/tugsbayasgalan/82/base -> origin/gh/tugsbayasgalan/82/base 2025-12-04T09:43:32.7355162Z * [new branch] gh/tugsbayasgalan/82/head -> origin/gh/tugsbayasgalan/82/head 2025-12-04T09:43:32.7355532Z * [new branch] gh/tugsbayasgalan/82/orig -> origin/gh/tugsbayasgalan/82/orig 2025-12-04T09:43:32.7355859Z * [new branch] gh/tugsbayasgalan/83/base -> origin/gh/tugsbayasgalan/83/base 2025-12-04T09:43:32.7356151Z * [new branch] gh/tugsbayasgalan/83/head -> origin/gh/tugsbayasgalan/83/head 2025-12-04T09:43:32.7356344Z * [new branch] gh/tugsbayasgalan/83/orig -> origin/gh/tugsbayasgalan/83/orig 2025-12-04T09:43:32.7358596Z * [new branch] gh/tugsbayasgalan/84/base -> origin/gh/tugsbayasgalan/84/base 2025-12-04T09:43:32.7360305Z * [new branch] gh/tugsbayasgalan/84/head -> origin/gh/tugsbayasgalan/84/head 2025-12-04T09:43:32.7361982Z * [new branch] gh/tugsbayasgalan/84/orig -> origin/gh/tugsbayasgalan/84/orig 2025-12-04T09:43:32.7364267Z * [new branch] gh/tugsbayasgalan/85/base -> origin/gh/tugsbayasgalan/85/base 2025-12-04T09:43:32.7366029Z * [new branch] gh/tugsbayasgalan/85/head -> origin/gh/tugsbayasgalan/85/head 2025-12-04T09:43:32.7367791Z * [new branch] gh/tugsbayasgalan/85/orig -> origin/gh/tugsbayasgalan/85/orig 2025-12-04T09:43:32.7370112Z * [new branch] gh/tugsbayasgalan/86/base -> origin/gh/tugsbayasgalan/86/base 2025-12-04T09:43:32.7371917Z * [new branch] gh/tugsbayasgalan/86/head -> origin/gh/tugsbayasgalan/86/head 2025-12-04T09:43:32.7373587Z * [new branch] gh/tugsbayasgalan/86/orig -> origin/gh/tugsbayasgalan/86/orig 2025-12-04T09:43:32.7376228Z * [new branch] gh/tugsbayasgalan/87/base -> origin/gh/tugsbayasgalan/87/base 2025-12-04T09:43:32.7378359Z * [new branch] gh/tugsbayasgalan/87/head -> origin/gh/tugsbayasgalan/87/head 2025-12-04T09:43:32.7380071Z * [new branch] gh/tugsbayasgalan/87/orig -> origin/gh/tugsbayasgalan/87/orig 2025-12-04T09:43:32.7382634Z * [new branch] gh/tugsbayasgalan/88/base -> origin/gh/tugsbayasgalan/88/base 2025-12-04T09:43:32.7384306Z * [new branch] gh/tugsbayasgalan/88/head -> origin/gh/tugsbayasgalan/88/head 2025-12-04T09:43:32.7386176Z * [new branch] gh/tugsbayasgalan/88/orig -> origin/gh/tugsbayasgalan/88/orig 2025-12-04T09:43:32.7388739Z * [new branch] gh/tugsbayasgalan/89/base -> origin/gh/tugsbayasgalan/89/base 2025-12-04T09:43:32.7390472Z * [new branch] gh/tugsbayasgalan/89/head -> origin/gh/tugsbayasgalan/89/head 2025-12-04T09:43:32.7392231Z * [new branch] gh/tugsbayasgalan/89/orig -> origin/gh/tugsbayasgalan/89/orig 2025-12-04T09:43:32.7394608Z * [new branch] gh/tugsbayasgalan/9/base -> origin/gh/tugsbayasgalan/9/base 2025-12-04T09:43:32.7396191Z * [new branch] gh/tugsbayasgalan/9/head -> origin/gh/tugsbayasgalan/9/head 2025-12-04T09:43:32.7397869Z * [new branch] gh/tugsbayasgalan/9/orig -> origin/gh/tugsbayasgalan/9/orig 2025-12-04T09:43:32.7400571Z * [new branch] gh/tugsbayasgalan/90/base -> origin/gh/tugsbayasgalan/90/base 2025-12-04T09:43:32.7402148Z * [new branch] gh/tugsbayasgalan/90/head -> origin/gh/tugsbayasgalan/90/head 2025-12-04T09:43:32.7403750Z * [new branch] gh/tugsbayasgalan/90/orig -> origin/gh/tugsbayasgalan/90/orig 2025-12-04T09:43:32.7406162Z * [new branch] gh/tugsbayasgalan/91/base -> origin/gh/tugsbayasgalan/91/base 2025-12-04T09:43:32.7407927Z * [new branch] gh/tugsbayasgalan/91/head -> origin/gh/tugsbayasgalan/91/head 2025-12-04T09:43:32.7409563Z * [new branch] gh/tugsbayasgalan/91/orig -> origin/gh/tugsbayasgalan/91/orig 2025-12-04T09:43:32.7412058Z * [new branch] gh/tugsbayasgalan/92/base -> origin/gh/tugsbayasgalan/92/base 2025-12-04T09:43:32.7413794Z * [new branch] gh/tugsbayasgalan/92/head -> origin/gh/tugsbayasgalan/92/head 2025-12-04T09:43:32.7415533Z * [new branch] gh/tugsbayasgalan/92/orig -> origin/gh/tugsbayasgalan/92/orig 2025-12-04T09:43:32.7418068Z * [new branch] gh/tugsbayasgalan/93/base -> origin/gh/tugsbayasgalan/93/base 2025-12-04T09:43:32.7419778Z * [new branch] gh/tugsbayasgalan/93/head -> origin/gh/tugsbayasgalan/93/head 2025-12-04T09:43:32.7421857Z * [new branch] gh/tugsbayasgalan/93/orig -> origin/gh/tugsbayasgalan/93/orig 2025-12-04T09:43:32.7424862Z * [new branch] gh/v0i0/14/base -> origin/gh/v0i0/14/base 2025-12-04T09:43:32.7426471Z * [new branch] gh/v0i0/14/head -> origin/gh/v0i0/14/head 2025-12-04T09:43:32.7428231Z * [new branch] gh/v0i0/14/orig -> origin/gh/v0i0/14/orig 2025-12-04T09:43:32.7430495Z * [new branch] gh/v0i0/15/base -> origin/gh/v0i0/15/base 2025-12-04T09:43:32.7432296Z * [new branch] gh/v0i0/15/head -> origin/gh/v0i0/15/head 2025-12-04T09:43:32.7434088Z * [new branch] gh/v0i0/15/orig -> origin/gh/v0i0/15/orig 2025-12-04T09:43:32.7436429Z * [new branch] gh/v0i0/16/base -> origin/gh/v0i0/16/base 2025-12-04T09:43:32.7438174Z * [new branch] gh/v0i0/16/head -> origin/gh/v0i0/16/head 2025-12-04T09:43:32.7439850Z * [new branch] gh/v0i0/16/orig -> origin/gh/v0i0/16/orig 2025-12-04T09:43:32.7442164Z * [new branch] gh/v0i0/17/base -> origin/gh/v0i0/17/base 2025-12-04T09:43:32.7443810Z * [new branch] gh/v0i0/17/head -> origin/gh/v0i0/17/head 2025-12-04T09:43:32.7445469Z * [new branch] gh/v0i0/17/orig -> origin/gh/v0i0/17/orig 2025-12-04T09:43:32.7447840Z * [new branch] gh/v0i0/18/base -> origin/gh/v0i0/18/base 2025-12-04T09:43:32.7449550Z * [new branch] gh/v0i0/18/head -> origin/gh/v0i0/18/head 2025-12-04T09:43:32.7451654Z * [new branch] gh/v0i0/18/orig -> origin/gh/v0i0/18/orig 2025-12-04T09:43:32.7453978Z * [new branch] gh/v0i0/19/base -> origin/gh/v0i0/19/base 2025-12-04T09:43:32.7455904Z * [new branch] gh/v0i0/19/head -> origin/gh/v0i0/19/head 2025-12-04T09:43:32.7457665Z * [new branch] gh/v0i0/19/orig -> origin/gh/v0i0/19/orig 2025-12-04T09:43:32.7460536Z * [new branch] gh/vishal9-team/1/base -> origin/gh/vishal9-team/1/base 2025-12-04T09:43:32.7462223Z * [new branch] gh/vishal9-team/1/head -> origin/gh/vishal9-team/1/head 2025-12-04T09:43:32.7464390Z * [new branch] gh/vishal9-team/2/base -> origin/gh/vishal9-team/2/base 2025-12-04T09:43:32.7466105Z * [new branch] gh/vishal9-team/2/head -> origin/gh/vishal9-team/2/head 2025-12-04T09:43:32.7467944Z * [new branch] gh/vishal9-team/2/orig -> origin/gh/vishal9-team/2/orig 2025-12-04T09:43:32.7470377Z * [new branch] gh/vishal9-team/3/base -> origin/gh/vishal9-team/3/base 2025-12-04T09:43:32.7472123Z * [new branch] gh/vishal9-team/3/head -> origin/gh/vishal9-team/3/head 2025-12-04T09:43:32.7473794Z * [new branch] gh/vishal9-team/3/orig -> origin/gh/vishal9-team/3/orig 2025-12-04T09:43:32.7475965Z * [new branch] gh/vishal9-team/4/base -> origin/gh/vishal9-team/4/base 2025-12-04T09:43:32.7477675Z * [new branch] gh/vishal9-team/4/head -> origin/gh/vishal9-team/4/head 2025-12-04T09:43:32.7479489Z * [new branch] gh/vishal9-team/4/orig -> origin/gh/vishal9-team/4/orig 2025-12-04T09:43:32.7482261Z * [new branch] gh/vkuzo/1/next -> origin/gh/vkuzo/1/next 2025-12-04T09:43:32.7484552Z * [new branch] gh/vkuzo/2/next -> origin/gh/vkuzo/2/next 2025-12-04T09:43:32.7486869Z * [new branch] gh/vkuzo/3/next -> origin/gh/vkuzo/3/next 2025-12-04T09:43:32.7490284Z * [new branch] gh/wconstab/424/base -> origin/gh/wconstab/424/base 2025-12-04T09:43:32.7492136Z * [new branch] gh/wconstab/424/head -> origin/gh/wconstab/424/head 2025-12-04T09:43:32.7493944Z * [new branch] gh/wconstab/424/orig -> origin/gh/wconstab/424/orig 2025-12-04T09:43:32.7496304Z * [new branch] gh/wconstab/435/base -> origin/gh/wconstab/435/base 2025-12-04T09:43:32.7498091Z * [new branch] gh/wconstab/435/head -> origin/gh/wconstab/435/head 2025-12-04T09:43:32.7500053Z * [new branch] gh/wconstab/435/orig -> origin/gh/wconstab/435/orig 2025-12-04T09:43:32.7502538Z * [new branch] gh/wconstab/444/base -> origin/gh/wconstab/444/base 2025-12-04T09:43:32.7504312Z * [new branch] gh/wconstab/444/head -> origin/gh/wconstab/444/head 2025-12-04T09:43:32.7506057Z * [new branch] gh/wconstab/444/orig -> origin/gh/wconstab/444/orig 2025-12-04T09:43:32.7508538Z * [new branch] gh/wconstab/447/base -> origin/gh/wconstab/447/base 2025-12-04T09:43:32.7510412Z * [new branch] gh/wconstab/447/head -> origin/gh/wconstab/447/head 2025-12-04T09:43:32.7512245Z * [new branch] gh/wconstab/447/orig -> origin/gh/wconstab/447/orig 2025-12-04T09:43:32.7514593Z * [new branch] gh/wconstab/448/base -> origin/gh/wconstab/448/base 2025-12-04T09:43:32.7516306Z * [new branch] gh/wconstab/448/head -> origin/gh/wconstab/448/head 2025-12-04T09:43:32.7518004Z * [new branch] gh/wconstab/448/orig -> origin/gh/wconstab/448/orig 2025-12-04T09:43:32.7520224Z * [new branch] gh/wconstab/449/base -> origin/gh/wconstab/449/base 2025-12-04T09:43:32.7521911Z * [new branch] gh/wconstab/449/head -> origin/gh/wconstab/449/head 2025-12-04T09:43:32.7523711Z * [new branch] gh/wconstab/449/orig -> origin/gh/wconstab/449/orig 2025-12-04T09:43:32.7525911Z * [new branch] gh/wconstab/450/base -> origin/gh/wconstab/450/base 2025-12-04T09:43:32.7527712Z * [new branch] gh/wconstab/450/head -> origin/gh/wconstab/450/head 2025-12-04T09:43:32.7529417Z * [new branch] gh/wconstab/450/orig -> origin/gh/wconstab/450/orig 2025-12-04T09:43:32.7531598Z * [new branch] gh/wconstab/451/base -> origin/gh/wconstab/451/base 2025-12-04T09:43:32.7533528Z * [new branch] gh/wconstab/451/head -> origin/gh/wconstab/451/head 2025-12-04T09:43:32.7535259Z * [new branch] gh/wconstab/451/orig -> origin/gh/wconstab/451/orig 2025-12-04T09:43:32.7537629Z * [new branch] gh/wconstab/452/base -> origin/gh/wconstab/452/base 2025-12-04T09:43:32.7539284Z * [new branch] gh/wconstab/452/head -> origin/gh/wconstab/452/head 2025-12-04T09:43:32.7541043Z * [new branch] gh/wconstab/452/orig -> origin/gh/wconstab/452/orig 2025-12-04T09:43:32.7543220Z * [new branch] gh/wconstab/453/base -> origin/gh/wconstab/453/base 2025-12-04T09:43:32.7546454Z * [new branch] gh/wconstab/453/head -> origin/gh/wconstab/453/head 2025-12-04T09:43:32.7547382Z * [new branch] gh/wconstab/453/orig -> origin/gh/wconstab/453/orig 2025-12-04T09:43:32.7549295Z * [new branch] gh/wconstab/454/base -> origin/gh/wconstab/454/base 2025-12-04T09:43:32.7551067Z * [new branch] gh/wconstab/454/head -> origin/gh/wconstab/454/head 2025-12-04T09:43:32.7552694Z * [new branch] gh/wconstab/454/orig -> origin/gh/wconstab/454/orig 2025-12-04T09:43:32.7554995Z * [new branch] gh/wconstab/455/base -> origin/gh/wconstab/455/base 2025-12-04T09:43:32.7558307Z * [new branch] gh/wconstab/455/head -> origin/gh/wconstab/455/head 2025-12-04T09:43:32.7560092Z * [new branch] gh/wconstab/455/orig -> origin/gh/wconstab/455/orig 2025-12-04T09:43:32.7562530Z * [new branch] gh/wconstab/456/base -> origin/gh/wconstab/456/base 2025-12-04T09:43:32.7564457Z * [new branch] gh/wconstab/456/head -> origin/gh/wconstab/456/head 2025-12-04T09:43:32.7566252Z * [new branch] gh/wconstab/456/orig -> origin/gh/wconstab/456/orig 2025-12-04T09:43:32.7568625Z * [new branch] gh/wconstab/457/base -> origin/gh/wconstab/457/base 2025-12-04T09:43:32.7570388Z * [new branch] gh/wconstab/457/head -> origin/gh/wconstab/457/head 2025-12-04T09:43:32.7572117Z * [new branch] gh/wconstab/457/orig -> origin/gh/wconstab/457/orig 2025-12-04T09:43:32.7574872Z * [new branch] gh/wconstab/458/base -> origin/gh/wconstab/458/base 2025-12-04T09:43:32.7576672Z * [new branch] gh/wconstab/458/head -> origin/gh/wconstab/458/head 2025-12-04T09:43:32.7578811Z * [new branch] gh/wconstab/458/orig -> origin/gh/wconstab/458/orig 2025-12-04T09:43:32.7581037Z * [new branch] gh/wconstab/459/base -> origin/gh/wconstab/459/base 2025-12-04T09:43:32.7582825Z * [new branch] gh/wconstab/459/head -> origin/gh/wconstab/459/head 2025-12-04T09:43:32.7584522Z * [new branch] gh/wconstab/459/orig -> origin/gh/wconstab/459/orig 2025-12-04T09:43:32.7587489Z * [new branch] gh/wconstab/460/base -> origin/gh/wconstab/460/base 2025-12-04T09:43:32.7589471Z * [new branch] gh/wconstab/460/head -> origin/gh/wconstab/460/head 2025-12-04T09:43:32.7591224Z * [new branch] gh/wconstab/460/orig -> origin/gh/wconstab/460/orig 2025-12-04T09:43:32.7593714Z * [new branch] gh/wconstab/461/base -> origin/gh/wconstab/461/base 2025-12-04T09:43:32.7595410Z * [new branch] gh/wconstab/461/head -> origin/gh/wconstab/461/head 2025-12-04T09:43:32.7597198Z * [new branch] gh/wconstab/461/orig -> origin/gh/wconstab/461/orig 2025-12-04T09:43:32.7599428Z * [new branch] gh/wconstab/462/base -> origin/gh/wconstab/462/base 2025-12-04T09:43:32.7601345Z * [new branch] gh/wconstab/462/head -> origin/gh/wconstab/462/head 2025-12-04T09:43:32.7603085Z * [new branch] gh/wconstab/462/orig -> origin/gh/wconstab/462/orig 2025-12-04T09:43:32.7605559Z * [new branch] gh/wconstab/463/base -> origin/gh/wconstab/463/base 2025-12-04T09:43:32.7607315Z * [new branch] gh/wconstab/463/head -> origin/gh/wconstab/463/head 2025-12-04T09:43:32.7609015Z * [new branch] gh/wconstab/463/orig -> origin/gh/wconstab/463/orig 2025-12-04T09:43:32.7611449Z * [new branch] gh/wconstab/464/base -> origin/gh/wconstab/464/base 2025-12-04T09:43:32.7613419Z * [new branch] gh/wconstab/464/head -> origin/gh/wconstab/464/head 2025-12-04T09:43:32.7615074Z * [new branch] gh/wconstab/464/orig -> origin/gh/wconstab/464/orig 2025-12-04T09:43:32.7617362Z * [new branch] gh/wconstab/465/base -> origin/gh/wconstab/465/base 2025-12-04T09:43:32.7619125Z * [new branch] gh/wconstab/465/head -> origin/gh/wconstab/465/head 2025-12-04T09:43:32.7620843Z * [new branch] gh/wconstab/465/orig -> origin/gh/wconstab/465/orig 2025-12-04T09:43:32.7623299Z * [new branch] gh/wconstab/466/base -> origin/gh/wconstab/466/base 2025-12-04T09:43:32.7624948Z * [new branch] gh/wconstab/466/head -> origin/gh/wconstab/466/head 2025-12-04T09:43:32.7626553Z * [new branch] gh/wconstab/466/orig -> origin/gh/wconstab/466/orig 2025-12-04T09:43:32.7629505Z * [new branch] gh/wconstab/467/base -> origin/gh/wconstab/467/base 2025-12-04T09:43:32.7631199Z * [new branch] gh/wconstab/467/head -> origin/gh/wconstab/467/head 2025-12-04T09:43:32.7632949Z * [new branch] gh/wconstab/467/orig -> origin/gh/wconstab/467/orig 2025-12-04T09:43:32.7635231Z * [new branch] gh/wconstab/468/base -> origin/gh/wconstab/468/base 2025-12-04T09:43:32.7636918Z * [new branch] gh/wconstab/468/head -> origin/gh/wconstab/468/head 2025-12-04T09:43:32.7638593Z * [new branch] gh/wconstab/468/orig -> origin/gh/wconstab/468/orig 2025-12-04T09:43:32.7641544Z * [new branch] gh/weifengpy/39/base -> origin/gh/weifengpy/39/base 2025-12-04T09:43:32.7643289Z * [new branch] gh/weifengpy/39/head -> origin/gh/weifengpy/39/head 2025-12-04T09:43:32.7645038Z * [new branch] gh/weifengpy/39/orig -> origin/gh/weifengpy/39/orig 2025-12-04T09:43:32.7647923Z * [new branch] gh/weifengpy/40/base -> origin/gh/weifengpy/40/base 2025-12-04T09:43:32.7649687Z * [new branch] gh/weifengpy/40/head -> origin/gh/weifengpy/40/head 2025-12-04T09:43:32.7651393Z * [new branch] gh/weifengpy/40/orig -> origin/gh/weifengpy/40/orig 2025-12-04T09:43:32.7653776Z * [new branch] gh/weifengpy/41/base -> origin/gh/weifengpy/41/base 2025-12-04T09:43:32.7655729Z * [new branch] gh/weifengpy/41/head -> origin/gh/weifengpy/41/head 2025-12-04T09:43:32.7657874Z * [new branch] gh/weifengpy/41/orig -> origin/gh/weifengpy/41/orig 2025-12-04T09:43:32.7660724Z * [new branch] gh/williamwen42/250/base -> origin/gh/williamwen42/250/base 2025-12-04T09:43:32.7662507Z * [new branch] gh/williamwen42/250/head -> origin/gh/williamwen42/250/head 2025-12-04T09:43:32.7664206Z * [new branch] gh/williamwen42/250/orig -> origin/gh/williamwen42/250/orig 2025-12-04T09:43:32.7666717Z * [new branch] gh/williamwen42/279/base -> origin/gh/williamwen42/279/base 2025-12-04T09:43:32.7668765Z * [new branch] gh/williamwen42/279/head -> origin/gh/williamwen42/279/head 2025-12-04T09:43:32.7670523Z * [new branch] gh/williamwen42/279/orig -> origin/gh/williamwen42/279/orig 2025-12-04T09:43:32.7672860Z * [new branch] gh/williamwen42/282/base -> origin/gh/williamwen42/282/base 2025-12-04T09:43:32.7674532Z * [new branch] gh/williamwen42/282/head -> origin/gh/williamwen42/282/head 2025-12-04T09:43:32.7676238Z * [new branch] gh/williamwen42/282/orig -> origin/gh/williamwen42/282/orig 2025-12-04T09:43:32.7678636Z * [new branch] gh/williamwen42/287/base -> origin/gh/williamwen42/287/base 2025-12-04T09:43:32.7680460Z * [new branch] gh/williamwen42/287/head -> origin/gh/williamwen42/287/head 2025-12-04T09:43:32.7682085Z * [new branch] gh/williamwen42/287/orig -> origin/gh/williamwen42/287/orig 2025-12-04T09:43:32.7684559Z * [new branch] gh/williamwen42/288/base -> origin/gh/williamwen42/288/base 2025-12-04T09:43:32.7686176Z * [new branch] gh/williamwen42/288/head -> origin/gh/williamwen42/288/head 2025-12-04T09:43:32.7687847Z * [new branch] gh/williamwen42/288/orig -> origin/gh/williamwen42/288/orig 2025-12-04T09:43:32.7690430Z * [new branch] gh/williamwen42/296/base -> origin/gh/williamwen42/296/base 2025-12-04T09:43:32.7692753Z * [new branch] gh/williamwen42/296/head -> origin/gh/williamwen42/296/head 2025-12-04T09:43:32.7694590Z * [new branch] gh/williamwen42/296/orig -> origin/gh/williamwen42/296/orig 2025-12-04T09:43:32.7696844Z * [new branch] gh/williamwen42/297/base -> origin/gh/williamwen42/297/base 2025-12-04T09:43:32.7698569Z * [new branch] gh/williamwen42/297/head -> origin/gh/williamwen42/297/head 2025-12-04T09:43:32.7700432Z * [new branch] gh/williamwen42/297/orig -> origin/gh/williamwen42/297/orig 2025-12-04T09:43:32.7702740Z * [new branch] gh/williamwen42/306/base -> origin/gh/williamwen42/306/base 2025-12-04T09:43:32.7704619Z * [new branch] gh/williamwen42/306/head -> origin/gh/williamwen42/306/head 2025-12-04T09:43:32.7706292Z * [new branch] gh/williamwen42/306/orig -> origin/gh/williamwen42/306/orig 2025-12-04T09:43:32.7708826Z * [new branch] gh/williamwen42/309/base -> origin/gh/williamwen42/309/base 2025-12-04T09:43:32.7710612Z * [new branch] gh/williamwen42/309/head -> origin/gh/williamwen42/309/head 2025-12-04T09:43:32.7712318Z * [new branch] gh/williamwen42/309/orig -> origin/gh/williamwen42/309/orig 2025-12-04T09:43:32.7714712Z * [new branch] gh/williamwen42/310/base -> origin/gh/williamwen42/310/base 2025-12-04T09:43:32.7716419Z * [new branch] gh/williamwen42/310/head -> origin/gh/williamwen42/310/head 2025-12-04T09:43:32.7718265Z * [new branch] gh/williamwen42/310/orig -> origin/gh/williamwen42/310/orig 2025-12-04T09:43:32.7721500Z * [new branch] gh/williamwen42/311/base -> origin/gh/williamwen42/311/base 2025-12-04T09:43:32.7723204Z * [new branch] gh/williamwen42/311/head -> origin/gh/williamwen42/311/head 2025-12-04T09:43:32.7724961Z * [new branch] gh/williamwen42/311/orig -> origin/gh/williamwen42/311/orig 2025-12-04T09:43:32.7727171Z * [new branch] gh/williamwen42/319/base -> origin/gh/williamwen42/319/base 2025-12-04T09:43:32.7728898Z * [new branch] gh/williamwen42/319/head -> origin/gh/williamwen42/319/head 2025-12-04T09:43:32.7730625Z * [new branch] gh/williamwen42/319/orig -> origin/gh/williamwen42/319/orig 2025-12-04T09:43:32.7733010Z * [new branch] gh/williamwen42/325/base -> origin/gh/williamwen42/325/base 2025-12-04T09:43:32.7734829Z * [new branch] gh/williamwen42/325/head -> origin/gh/williamwen42/325/head 2025-12-04T09:43:32.7736539Z * [new branch] gh/williamwen42/325/orig -> origin/gh/williamwen42/325/orig 2025-12-04T09:43:32.7738939Z * [new branch] gh/williamwen42/326/base -> origin/gh/williamwen42/326/base 2025-12-04T09:43:32.7740758Z * [new branch] gh/williamwen42/326/head -> origin/gh/williamwen42/326/head 2025-12-04T09:43:32.7742625Z * [new branch] gh/williamwen42/326/orig -> origin/gh/williamwen42/326/orig 2025-12-04T09:43:32.7744963Z * [new branch] gh/williamwen42/327/base -> origin/gh/williamwen42/327/base 2025-12-04T09:43:32.7746666Z * [new branch] gh/williamwen42/327/head -> origin/gh/williamwen42/327/head 2025-12-04T09:43:32.7748513Z * [new branch] gh/williamwen42/327/orig -> origin/gh/williamwen42/327/orig 2025-12-04T09:43:32.7750967Z * [new branch] gh/williamwen42/328/base -> origin/gh/williamwen42/328/base 2025-12-04T09:43:32.7752884Z * [new branch] gh/williamwen42/328/head -> origin/gh/williamwen42/328/head 2025-12-04T09:43:32.7754474Z * [new branch] gh/williamwen42/328/orig -> origin/gh/williamwen42/328/orig 2025-12-04T09:43:32.7757599Z * [new branch] gh/williamwen42/329/base -> origin/gh/williamwen42/329/base 2025-12-04T09:43:32.7759360Z * [new branch] gh/williamwen42/329/head -> origin/gh/williamwen42/329/head 2025-12-04T09:43:32.7761054Z * [new branch] gh/williamwen42/329/orig -> origin/gh/williamwen42/329/orig 2025-12-04T09:43:32.7763533Z * [new branch] gh/williamwen42/330/base -> origin/gh/williamwen42/330/base 2025-12-04T09:43:32.7765261Z * [new branch] gh/williamwen42/330/head -> origin/gh/williamwen42/330/head 2025-12-04T09:43:32.7766986Z * [new branch] gh/williamwen42/330/orig -> origin/gh/williamwen42/330/orig 2025-12-04T09:43:32.7769413Z * [new branch] gh/williamwen42/331/base -> origin/gh/williamwen42/331/base 2025-12-04T09:43:32.7771636Z * [new branch] gh/williamwen42/331/head -> origin/gh/williamwen42/331/head 2025-12-04T09:43:32.7773300Z * [new branch] gh/williamwen42/331/orig -> origin/gh/williamwen42/331/orig 2025-12-04T09:43:32.7775998Z * [new branch] gh/williamwen42/332/base -> origin/gh/williamwen42/332/base 2025-12-04T09:43:32.7777662Z * [new branch] gh/williamwen42/332/head -> origin/gh/williamwen42/332/head 2025-12-04T09:43:32.7779352Z * [new branch] gh/williamwen42/332/orig -> origin/gh/williamwen42/332/orig 2025-12-04T09:43:32.7781910Z * [new branch] gh/williamwen42/333/base -> origin/gh/williamwen42/333/base 2025-12-04T09:43:32.7783685Z * [new branch] gh/williamwen42/333/head -> origin/gh/williamwen42/333/head 2025-12-04T09:43:32.7785363Z * [new branch] gh/williamwen42/333/orig -> origin/gh/williamwen42/333/orig 2025-12-04T09:43:32.7787901Z * [new branch] gh/williamwen42/334/base -> origin/gh/williamwen42/334/base 2025-12-04T09:43:32.7789600Z * [new branch] gh/williamwen42/334/head -> origin/gh/williamwen42/334/head 2025-12-04T09:43:32.7791409Z * [new branch] gh/williamwen42/334/orig -> origin/gh/williamwen42/334/orig 2025-12-04T09:43:32.7796822Z * [new branch] gh/williamwen42/335/base -> origin/gh/williamwen42/335/base 2025-12-04T09:43:32.7798579Z * [new branch] gh/williamwen42/335/head -> origin/gh/williamwen42/335/head 2025-12-04T09:43:32.7800246Z * [new branch] gh/williamwen42/335/orig -> origin/gh/williamwen42/335/orig 2025-12-04T09:43:32.7802679Z * [new branch] gh/williamwen42/336/base -> origin/gh/williamwen42/336/base 2025-12-04T09:43:32.7804324Z * [new branch] gh/williamwen42/336/head -> origin/gh/williamwen42/336/head 2025-12-04T09:43:32.7806001Z * [new branch] gh/williamwen42/336/orig -> origin/gh/williamwen42/336/orig 2025-12-04T09:43:32.7808385Z * [new branch] gh/williamwen42/337/base -> origin/gh/williamwen42/337/base 2025-12-04T09:43:32.7810081Z * [new branch] gh/williamwen42/337/head -> origin/gh/williamwen42/337/head 2025-12-04T09:43:32.7811791Z * [new branch] gh/williamwen42/337/orig -> origin/gh/williamwen42/337/orig 2025-12-04T09:43:32.7814304Z * [new branch] gh/williamwen42/338/base -> origin/gh/williamwen42/338/base 2025-12-04T09:43:32.7815988Z * [new branch] gh/williamwen42/338/head -> origin/gh/williamwen42/338/head 2025-12-04T09:43:32.7817670Z * [new branch] gh/williamwen42/338/orig -> origin/gh/williamwen42/338/orig 2025-12-04T09:43:32.7820039Z * [new branch] gh/williamwen42/339/base -> origin/gh/williamwen42/339/base 2025-12-04T09:43:32.7821937Z * [new branch] gh/williamwen42/339/head -> origin/gh/williamwen42/339/head 2025-12-04T09:43:32.7823619Z * [new branch] gh/williamwen42/339/orig -> origin/gh/williamwen42/339/orig 2025-12-04T09:43:32.7826034Z * [new branch] gh/williamwen42/340/base -> origin/gh/williamwen42/340/base 2025-12-04T09:43:32.7827743Z * [new branch] gh/williamwen42/340/head -> origin/gh/williamwen42/340/head 2025-12-04T09:43:32.7829384Z * [new branch] gh/williamwen42/340/orig -> origin/gh/williamwen42/340/orig 2025-12-04T09:43:32.7831851Z * [new branch] gh/williamwen42/341/base -> origin/gh/williamwen42/341/base 2025-12-04T09:43:32.7833591Z * [new branch] gh/williamwen42/341/head -> origin/gh/williamwen42/341/head 2025-12-04T09:43:32.7835313Z * [new branch] gh/williamwen42/341/orig -> origin/gh/williamwen42/341/orig 2025-12-04T09:43:32.7837793Z * [new branch] gh/williamwen42/342/base -> origin/gh/williamwen42/342/base 2025-12-04T09:43:32.7839537Z * [new branch] gh/williamwen42/342/head -> origin/gh/williamwen42/342/head 2025-12-04T09:43:32.7841294Z * [new branch] gh/williamwen42/342/orig -> origin/gh/williamwen42/342/orig 2025-12-04T09:43:32.7843689Z * [new branch] gh/williamwen42/343/base -> origin/gh/williamwen42/343/base 2025-12-04T09:43:32.7845415Z * [new branch] gh/williamwen42/343/head -> origin/gh/williamwen42/343/head 2025-12-04T09:43:32.7847104Z * [new branch] gh/williamwen42/343/orig -> origin/gh/williamwen42/343/orig 2025-12-04T09:43:32.7849492Z * [new branch] gh/williamwen42/344/base -> origin/gh/williamwen42/344/base 2025-12-04T09:43:32.7851209Z * [new branch] gh/williamwen42/344/head -> origin/gh/williamwen42/344/head 2025-12-04T09:43:32.7852923Z * [new branch] gh/williamwen42/344/orig -> origin/gh/williamwen42/344/orig 2025-12-04T09:43:32.7855509Z * [new branch] gh/williamwen42/345/base -> origin/gh/williamwen42/345/base 2025-12-04T09:43:32.7857282Z * [new branch] gh/williamwen42/345/head -> origin/gh/williamwen42/345/head 2025-12-04T09:43:32.7858958Z * [new branch] gh/williamwen42/345/orig -> origin/gh/williamwen42/345/orig 2025-12-04T09:43:32.7861413Z * [new branch] gh/williamwen42/346/base -> origin/gh/williamwen42/346/base 2025-12-04T09:43:32.7863186Z * [new branch] gh/williamwen42/346/head -> origin/gh/williamwen42/346/head 2025-12-04T09:43:32.7864907Z * [new branch] gh/williamwen42/346/orig -> origin/gh/williamwen42/346/orig 2025-12-04T09:43:32.7867428Z * [new branch] gh/williamwen42/347/base -> origin/gh/williamwen42/347/base 2025-12-04T09:43:32.7869158Z * [new branch] gh/williamwen42/347/head -> origin/gh/williamwen42/347/head 2025-12-04T09:43:32.7871359Z * [new branch] gh/williamwen42/347/orig -> origin/gh/williamwen42/347/orig 2025-12-04T09:43:32.7873619Z * [new branch] gh/williamwen42/348/base -> origin/gh/williamwen42/348/base 2025-12-04T09:43:32.7875250Z * [new branch] gh/williamwen42/348/head -> origin/gh/williamwen42/348/head 2025-12-04T09:43:32.7876942Z * [new branch] gh/williamwen42/348/orig -> origin/gh/williamwen42/348/orig 2025-12-04T09:43:32.7879621Z * [new branch] gh/williamwen42/349/base -> origin/gh/williamwen42/349/base 2025-12-04T09:43:32.7881395Z * [new branch] gh/williamwen42/349/head -> origin/gh/williamwen42/349/head 2025-12-04T09:43:32.7883114Z * [new branch] gh/williamwen42/349/orig -> origin/gh/williamwen42/349/orig 2025-12-04T09:43:32.7885644Z * [new branch] gh/williamwen42/350/base -> origin/gh/williamwen42/350/base 2025-12-04T09:43:32.7887357Z * [new branch] gh/williamwen42/350/head -> origin/gh/williamwen42/350/head 2025-12-04T09:43:32.7889202Z * [new branch] gh/williamwen42/350/orig -> origin/gh/williamwen42/350/orig 2025-12-04T09:43:32.7891408Z * [new branch] gh/williamwen42/351/base -> origin/gh/williamwen42/351/base 2025-12-04T09:43:32.7893210Z * [new branch] gh/williamwen42/351/head -> origin/gh/williamwen42/351/head 2025-12-04T09:43:32.7894900Z * [new branch] gh/williamwen42/351/orig -> origin/gh/williamwen42/351/orig 2025-12-04T09:43:32.7897289Z * [new branch] gh/williamwen42/352/base -> origin/gh/williamwen42/352/base 2025-12-04T09:43:32.7899030Z * [new branch] gh/williamwen42/352/head -> origin/gh/williamwen42/352/head 2025-12-04T09:43:32.7900686Z * [new branch] gh/williamwen42/352/orig -> origin/gh/williamwen42/352/orig 2025-12-04T09:43:32.7903169Z * [new branch] gh/williamwen42/353/base -> origin/gh/williamwen42/353/base 2025-12-04T09:43:32.7904877Z * [new branch] gh/williamwen42/353/head -> origin/gh/williamwen42/353/head 2025-12-04T09:43:32.7906611Z * [new branch] gh/williamwen42/353/orig -> origin/gh/williamwen42/353/orig 2025-12-04T09:43:32.7909145Z * [new branch] gh/williamwen42/354/base -> origin/gh/williamwen42/354/base 2025-12-04T09:43:32.7910942Z * [new branch] gh/williamwen42/354/head -> origin/gh/williamwen42/354/head 2025-12-04T09:43:32.7912652Z * [new branch] gh/williamwen42/354/orig -> origin/gh/williamwen42/354/orig 2025-12-04T09:43:32.7915019Z * [new branch] gh/williamwen42/355/base -> origin/gh/williamwen42/355/base 2025-12-04T09:43:32.7916668Z * [new branch] gh/williamwen42/355/head -> origin/gh/williamwen42/355/head 2025-12-04T09:43:32.7918363Z * [new branch] gh/williamwen42/355/orig -> origin/gh/williamwen42/355/orig 2025-12-04T09:43:32.7920716Z * [new branch] gh/williamwen42/356/base -> origin/gh/williamwen42/356/base 2025-12-04T09:43:32.7922430Z * [new branch] gh/williamwen42/356/head -> origin/gh/williamwen42/356/head 2025-12-04T09:43:32.7924110Z * [new branch] gh/williamwen42/356/orig -> origin/gh/williamwen42/356/orig 2025-12-04T09:43:32.7926475Z * [new branch] gh/williamwen42/357/base -> origin/gh/williamwen42/357/base 2025-12-04T09:43:32.7928196Z * [new branch] gh/williamwen42/357/head -> origin/gh/williamwen42/357/head 2025-12-04T09:43:32.7929880Z * [new branch] gh/williamwen42/357/orig -> origin/gh/williamwen42/357/orig 2025-12-04T09:43:32.7932356Z * [new branch] gh/williamwen42/358/base -> origin/gh/williamwen42/358/base 2025-12-04T09:43:32.7934055Z * [new branch] gh/williamwen42/358/head -> origin/gh/williamwen42/358/head 2025-12-04T09:43:32.7935863Z * [new branch] gh/williamwen42/358/orig -> origin/gh/williamwen42/358/orig 2025-12-04T09:43:32.7938637Z * [new branch] gh/xmfan/169/base -> origin/gh/xmfan/169/base 2025-12-04T09:43:32.7940402Z * [new branch] gh/xmfan/169/head -> origin/gh/xmfan/169/head 2025-12-04T09:43:32.7942695Z * [new branch] gh/xmfan/170/base -> origin/gh/xmfan/170/base 2025-12-04T09:43:32.7944332Z * [new branch] gh/xmfan/170/head -> origin/gh/xmfan/170/head 2025-12-04T09:43:32.7946612Z * [new branch] gh/xmfan/274/base -> origin/gh/xmfan/274/base 2025-12-04T09:43:32.7948633Z * [new branch] gh/xmfan/274/head -> origin/gh/xmfan/274/head 2025-12-04T09:43:32.7950382Z * [new branch] gh/xmfan/274/orig -> origin/gh/xmfan/274/orig 2025-12-04T09:43:32.7952679Z * [new branch] gh/xmfan/277/base -> origin/gh/xmfan/277/base 2025-12-04T09:43:32.7954429Z * [new branch] gh/xmfan/277/head -> origin/gh/xmfan/277/head 2025-12-04T09:43:32.7957948Z * [new branch] gh/xmfan/277/orig -> origin/gh/xmfan/277/orig 2025-12-04T09:43:32.7960385Z * [new branch] gh/xmfan/301/base -> origin/gh/xmfan/301/base 2025-12-04T09:43:32.7961851Z * [new branch] gh/xmfan/301/head -> origin/gh/xmfan/301/head 2025-12-04T09:43:32.7963509Z * [new branch] gh/xmfan/301/orig -> origin/gh/xmfan/301/orig 2025-12-04T09:43:32.7966232Z * [new branch] gh/xmfan/304/base -> origin/gh/xmfan/304/base 2025-12-04T09:43:32.7967957Z * [new branch] gh/xmfan/304/head -> origin/gh/xmfan/304/head 2025-12-04T09:43:32.7969626Z * [new branch] gh/xmfan/304/orig -> origin/gh/xmfan/304/orig 2025-12-04T09:43:32.7971878Z * [new branch] gh/xmfan/309/base -> origin/gh/xmfan/309/base 2025-12-04T09:43:32.7973592Z * [new branch] gh/xmfan/309/head -> origin/gh/xmfan/309/head 2025-12-04T09:43:32.7975235Z * [new branch] gh/xmfan/309/orig -> origin/gh/xmfan/309/orig 2025-12-04T09:43:32.7977526Z * [new branch] gh/xmfan/310/base -> origin/gh/xmfan/310/base 2025-12-04T09:43:32.7979281Z * [new branch] gh/xmfan/310/head -> origin/gh/xmfan/310/head 2025-12-04T09:43:32.7981003Z * [new branch] gh/xmfan/310/orig -> origin/gh/xmfan/310/orig 2025-12-04T09:43:32.7983279Z * [new branch] gh/xmfan/311/base -> origin/gh/xmfan/311/base 2025-12-04T09:43:32.7984993Z * [new branch] gh/xmfan/311/head -> origin/gh/xmfan/311/head 2025-12-04T09:43:32.7986651Z * [new branch] gh/xmfan/311/orig -> origin/gh/xmfan/311/orig 2025-12-04T09:43:32.7989059Z * [new branch] gh/xmfan/312/base -> origin/gh/xmfan/312/base 2025-12-04T09:43:32.7990700Z * [new branch] gh/xmfan/312/head -> origin/gh/xmfan/312/head 2025-12-04T09:43:32.7992369Z * [new branch] gh/xmfan/312/orig -> origin/gh/xmfan/312/orig 2025-12-04T09:43:32.7994660Z * [new branch] gh/xmfan/313/base -> origin/gh/xmfan/313/base 2025-12-04T09:43:32.7996341Z * [new branch] gh/xmfan/313/head -> origin/gh/xmfan/313/head 2025-12-04T09:43:32.7998019Z * [new branch] gh/xmfan/313/orig -> origin/gh/xmfan/313/orig 2025-12-04T09:43:32.8000893Z * [new branch] gh/xuanzhang816/27/base -> origin/gh/xuanzhang816/27/base 2025-12-04T09:43:32.8002578Z * [new branch] gh/xuanzhang816/27/head -> origin/gh/xuanzhang816/27/head 2025-12-04T09:43:32.8004302Z * [new branch] gh/xuanzhang816/27/orig -> origin/gh/xuanzhang816/27/orig 2025-12-04T09:43:32.8006663Z * [new branch] gh/xuanzhang816/32/base -> origin/gh/xuanzhang816/32/base 2025-12-04T09:43:32.8008404Z * [new branch] gh/xuanzhang816/32/head -> origin/gh/xuanzhang816/32/head 2025-12-04T09:43:32.8010165Z * [new branch] gh/xuanzhang816/32/orig -> origin/gh/xuanzhang816/32/orig 2025-12-04T09:43:32.8012553Z * [new branch] gh/xuanzhang816/33/base -> origin/gh/xuanzhang816/33/base 2025-12-04T09:43:32.8014211Z * [new branch] gh/xuanzhang816/33/head -> origin/gh/xuanzhang816/33/head 2025-12-04T09:43:32.8015945Z * [new branch] gh/xuanzhang816/33/orig -> origin/gh/xuanzhang816/33/orig 2025-12-04T09:43:32.8018528Z * [new branch] gh/xuanzhang816/34/base -> origin/gh/xuanzhang816/34/base 2025-12-04T09:43:32.8020198Z * [new branch] gh/xuanzhang816/34/head -> origin/gh/xuanzhang816/34/head 2025-12-04T09:43:32.8021909Z * [new branch] gh/xuanzhang816/34/orig -> origin/gh/xuanzhang816/34/orig 2025-12-04T09:43:32.8024454Z * [new branch] gh/xuanzhang816/35/base -> origin/gh/xuanzhang816/35/base 2025-12-04T09:43:32.8026169Z * [new branch] gh/xuanzhang816/35/head -> origin/gh/xuanzhang816/35/head 2025-12-04T09:43:32.8028025Z * [new branch] gh/xuanzhang816/35/orig -> origin/gh/xuanzhang816/35/orig 2025-12-04T09:43:32.8030782Z * [new branch] gh/yanbing-j/11/base -> origin/gh/yanbing-j/11/base 2025-12-04T09:43:32.8032503Z * [new branch] gh/yanbing-j/11/head -> origin/gh/yanbing-j/11/head 2025-12-04T09:43:32.8034244Z * [new branch] gh/yanbing-j/11/orig -> origin/gh/yanbing-j/11/orig 2025-12-04T09:43:32.8036504Z * [new branch] gh/yanbing-j/12/base -> origin/gh/yanbing-j/12/base 2025-12-04T09:43:32.8038205Z * [new branch] gh/yanbing-j/12/head -> origin/gh/yanbing-j/12/head 2025-12-04T09:43:32.8039919Z * [new branch] gh/yanbing-j/12/orig -> origin/gh/yanbing-j/12/orig 2025-12-04T09:43:32.8042705Z * [new branch] gh/yanbing-j/13/base -> origin/gh/yanbing-j/13/base 2025-12-04T09:43:32.8044400Z * [new branch] gh/yanbing-j/13/head -> origin/gh/yanbing-j/13/head 2025-12-04T09:43:32.8046160Z * [new branch] gh/yanbing-j/13/orig -> origin/gh/yanbing-j/13/orig 2025-12-04T09:43:32.8048499Z * [new branch] gh/yanbing-j/14/base -> origin/gh/yanbing-j/14/base 2025-12-04T09:43:32.8050482Z * [new branch] gh/yanbing-j/14/head -> origin/gh/yanbing-j/14/head 2025-12-04T09:43:32.8052271Z * [new branch] gh/yanbing-j/14/orig -> origin/gh/yanbing-j/14/orig 2025-12-04T09:43:32.8054476Z * [new branch] gh/yanbing-j/15/base -> origin/gh/yanbing-j/15/base 2025-12-04T09:43:32.8056394Z * [new branch] gh/yanbing-j/15/head -> origin/gh/yanbing-j/15/head 2025-12-04T09:43:32.8058062Z * [new branch] gh/yanbing-j/15/orig -> origin/gh/yanbing-j/15/orig 2025-12-04T09:43:32.8060365Z * [new branch] gh/yanbing-j/18/base -> origin/gh/yanbing-j/18/base 2025-12-04T09:43:32.8061995Z * [new branch] gh/yanbing-j/18/head -> origin/gh/yanbing-j/18/head 2025-12-04T09:43:32.8063736Z * [new branch] gh/yanbing-j/18/orig -> origin/gh/yanbing-j/18/orig 2025-12-04T09:43:32.8066477Z * [new branch] gh/yanbing-j/19/base -> origin/gh/yanbing-j/19/base 2025-12-04T09:43:32.8068303Z * [new branch] gh/yanbing-j/19/head -> origin/gh/yanbing-j/19/head 2025-12-04T09:43:32.8069963Z * [new branch] gh/yanbing-j/19/orig -> origin/gh/yanbing-j/19/orig 2025-12-04T09:43:32.8072363Z * [new branch] gh/yanbing-j/20/base -> origin/gh/yanbing-j/20/base 2025-12-04T09:43:32.8074033Z * [new branch] gh/yanbing-j/20/head -> origin/gh/yanbing-j/20/head 2025-12-04T09:43:32.8075768Z * [new branch] gh/yanbing-j/20/orig -> origin/gh/yanbing-j/20/orig 2025-12-04T09:43:32.8078098Z * [new branch] gh/yanbing-j/21/base -> origin/gh/yanbing-j/21/base 2025-12-04T09:43:32.8079833Z * [new branch] gh/yanbing-j/21/head -> origin/gh/yanbing-j/21/head 2025-12-04T09:43:32.8082095Z * [new branch] gh/yanbing-j/22/base -> origin/gh/yanbing-j/22/base 2025-12-04T09:43:32.8083793Z * [new branch] gh/yanbing-j/22/head -> origin/gh/yanbing-j/22/head 2025-12-04T09:43:32.8085540Z * [new branch] gh/yanbing-j/22/orig -> origin/gh/yanbing-j/22/orig 2025-12-04T09:43:32.8087785Z * [new branch] gh/yanbing-j/23/base -> origin/gh/yanbing-j/23/base 2025-12-04T09:43:32.8089538Z * [new branch] gh/yanbing-j/23/head -> origin/gh/yanbing-j/23/head 2025-12-04T09:43:32.8091211Z * [new branch] gh/yanbing-j/23/orig -> origin/gh/yanbing-j/23/orig 2025-12-04T09:43:32.8093520Z * [new branch] gh/yanbing-j/24/base -> origin/gh/yanbing-j/24/base 2025-12-04T09:43:32.8095193Z * [new branch] gh/yanbing-j/24/head -> origin/gh/yanbing-j/24/head 2025-12-04T09:43:32.8097030Z * [new branch] gh/yanbing-j/24/orig -> origin/gh/yanbing-j/24/orig 2025-12-04T09:43:32.8099192Z * [new branch] gh/yanbing-j/25/base -> origin/gh/yanbing-j/25/base 2025-12-04T09:43:32.8100869Z * [new branch] gh/yanbing-j/25/head -> origin/gh/yanbing-j/25/head 2025-12-04T09:43:32.8102538Z * [new branch] gh/yanbing-j/25/orig -> origin/gh/yanbing-j/25/orig 2025-12-04T09:43:32.8104840Z * [new branch] gh/yanbing-j/26/base -> origin/gh/yanbing-j/26/base 2025-12-04T09:43:32.8107052Z * [new branch] gh/yanbing-j/26/head -> origin/gh/yanbing-j/26/head 2025-12-04T09:43:32.8108895Z * [new branch] gh/yanbing-j/26/orig -> origin/gh/yanbing-j/26/orig 2025-12-04T09:43:32.8112239Z * [new branch] gh/yang-yu-hang/1/base -> origin/gh/yang-yu-hang/1/base 2025-12-04T09:43:32.8114086Z * [new branch] gh/yang-yu-hang/1/head -> origin/gh/yang-yu-hang/1/head 2025-12-04T09:43:32.8115960Z * [new branch] gh/yang-yu-hang/1/orig -> origin/gh/yang-yu-hang/1/orig 2025-12-04T09:43:32.8118292Z * [new branch] gh/yang-yu-hang/2/base -> origin/gh/yang-yu-hang/2/base 2025-12-04T09:43:32.8120398Z * [new branch] gh/yang-yu-hang/2/head -> origin/gh/yang-yu-hang/2/head 2025-12-04T09:43:32.8122448Z * [new branch] gh/yang-yu-hang/2/orig -> origin/gh/yang-yu-hang/2/orig 2025-12-04T09:43:32.8124727Z * [new branch] gh/yang-yu-hang/3/base -> origin/gh/yang-yu-hang/3/base 2025-12-04T09:43:32.8126914Z * [new branch] gh/yang-yu-hang/3/head -> origin/gh/yang-yu-hang/3/head 2025-12-04T09:43:32.8128711Z * [new branch] gh/yang-yu-hang/3/orig -> origin/gh/yang-yu-hang/3/orig 2025-12-04T09:43:32.8131810Z * [new branch] gh/yangw-dev/12/base -> origin/gh/yangw-dev/12/base 2025-12-04T09:43:32.8133485Z * [new branch] gh/yangw-dev/12/head -> origin/gh/yangw-dev/12/head 2025-12-04T09:43:32.8135190Z * [new branch] gh/yangw-dev/12/orig -> origin/gh/yangw-dev/12/orig 2025-12-04T09:43:32.8137956Z * [new branch] gh/yangw-dev/13/base -> origin/gh/yangw-dev/13/base 2025-12-04T09:43:32.8139734Z * [new branch] gh/yangw-dev/13/head -> origin/gh/yangw-dev/13/head 2025-12-04T09:43:32.8141442Z * [new branch] gh/yangw-dev/13/orig -> origin/gh/yangw-dev/13/orig 2025-12-04T09:43:32.8143781Z * [new branch] gh/yangw-dev/14/base -> origin/gh/yangw-dev/14/base 2025-12-04T09:43:32.8145517Z * [new branch] gh/yangw-dev/14/head -> origin/gh/yangw-dev/14/head 2025-12-04T09:43:32.8147183Z * [new branch] gh/yangw-dev/14/orig -> origin/gh/yangw-dev/14/orig 2025-12-04T09:43:32.8150015Z * [new branch] gh/yangw-dev/15/base -> origin/gh/yangw-dev/15/base 2025-12-04T09:43:32.8151667Z * [new branch] gh/yangw-dev/15/head -> origin/gh/yangw-dev/15/head 2025-12-04T09:43:32.8153376Z * [new branch] gh/yangw-dev/15/orig -> origin/gh/yangw-dev/15/orig 2025-12-04T09:43:32.8155821Z * [new branch] gh/yangw-dev/19/base -> origin/gh/yangw-dev/19/base 2025-12-04T09:43:32.8157728Z * [new branch] gh/yangw-dev/19/head -> origin/gh/yangw-dev/19/head 2025-12-04T09:43:32.8159473Z * [new branch] gh/yangw-dev/19/orig -> origin/gh/yangw-dev/19/orig 2025-12-04T09:43:32.8161866Z * [new branch] gh/yangw-dev/26/base -> origin/gh/yangw-dev/26/base 2025-12-04T09:43:32.8163952Z * [new branch] gh/yangw-dev/26/head -> origin/gh/yangw-dev/26/head 2025-12-04T09:43:32.8165716Z * [new branch] gh/yangw-dev/26/orig -> origin/gh/yangw-dev/26/orig 2025-12-04T09:43:32.8167986Z * [new branch] gh/yangw-dev/27/base -> origin/gh/yangw-dev/27/base 2025-12-04T09:43:32.8169853Z * [new branch] gh/yangw-dev/27/head -> origin/gh/yangw-dev/27/head 2025-12-04T09:43:32.8171421Z * [new branch] gh/yangw-dev/27/orig -> origin/gh/yangw-dev/27/orig 2025-12-04T09:43:32.8174218Z * [new branch] gh/ydwu4/292/base -> origin/gh/ydwu4/292/base 2025-12-04T09:43:32.8175925Z * [new branch] gh/ydwu4/292/head -> origin/gh/ydwu4/292/head 2025-12-04T09:43:32.8177688Z * [new branch] gh/ydwu4/292/orig -> origin/gh/ydwu4/292/orig 2025-12-04T09:43:32.8180042Z * [new branch] gh/ydwu4/294/base -> origin/gh/ydwu4/294/base 2025-12-04T09:43:32.8181746Z * [new branch] gh/ydwu4/294/head -> origin/gh/ydwu4/294/head 2025-12-04T09:43:32.8183387Z * [new branch] gh/ydwu4/294/orig -> origin/gh/ydwu4/294/orig 2025-12-04T09:43:32.8185834Z * [new branch] gh/ydwu4/295/base -> origin/gh/ydwu4/295/base 2025-12-04T09:43:32.8187715Z * [new branch] gh/ydwu4/295/head -> origin/gh/ydwu4/295/head 2025-12-04T09:43:32.8189464Z * [new branch] gh/ydwu4/295/orig -> origin/gh/ydwu4/295/orig 2025-12-04T09:43:32.8191695Z * [new branch] gh/ydwu4/296/base -> origin/gh/ydwu4/296/base 2025-12-04T09:43:32.8193314Z * [new branch] gh/ydwu4/296/head -> origin/gh/ydwu4/296/head 2025-12-04T09:43:32.8195036Z * [new branch] gh/ydwu4/296/orig -> origin/gh/ydwu4/296/orig 2025-12-04T09:43:32.8197372Z * [new branch] gh/ydwu4/306/base -> origin/gh/ydwu4/306/base 2025-12-04T09:43:32.8199212Z * [new branch] gh/ydwu4/306/head -> origin/gh/ydwu4/306/head 2025-12-04T09:43:32.8200979Z * [new branch] gh/ydwu4/306/orig -> origin/gh/ydwu4/306/orig 2025-12-04T09:43:32.8203282Z * [new branch] gh/ydwu4/312/base -> origin/gh/ydwu4/312/base 2025-12-04T09:43:32.8205004Z * [new branch] gh/ydwu4/312/head -> origin/gh/ydwu4/312/head 2025-12-04T09:43:32.8206697Z * [new branch] gh/ydwu4/312/orig -> origin/gh/ydwu4/312/orig 2025-12-04T09:43:32.8208946Z * [new branch] gh/ydwu4/322/base -> origin/gh/ydwu4/322/base 2025-12-04T09:43:32.8210749Z * [new branch] gh/ydwu4/322/head -> origin/gh/ydwu4/322/head 2025-12-04T09:43:32.8212464Z * [new branch] gh/ydwu4/322/orig -> origin/gh/ydwu4/322/orig 2025-12-04T09:43:32.8214728Z * [new branch] gh/ydwu4/327/base -> origin/gh/ydwu4/327/base 2025-12-04T09:43:32.8216450Z * [new branch] gh/ydwu4/327/head -> origin/gh/ydwu4/327/head 2025-12-04T09:43:32.8218138Z * [new branch] gh/ydwu4/327/orig -> origin/gh/ydwu4/327/orig 2025-12-04T09:43:32.8220561Z * [new branch] gh/ydwu4/328/base -> origin/gh/ydwu4/328/base 2025-12-04T09:43:32.8222212Z * [new branch] gh/ydwu4/328/head -> origin/gh/ydwu4/328/head 2025-12-04T09:43:32.8223865Z * [new branch] gh/ydwu4/328/orig -> origin/gh/ydwu4/328/orig 2025-12-04T09:43:32.8226038Z * [new branch] gh/ydwu4/329/base -> origin/gh/ydwu4/329/base 2025-12-04T09:43:32.8227788Z * [new branch] gh/ydwu4/329/head -> origin/gh/ydwu4/329/head 2025-12-04T09:43:32.8229503Z * [new branch] gh/ydwu4/329/orig -> origin/gh/ydwu4/329/orig 2025-12-04T09:43:32.8231963Z * [new branch] gh/ydwu4/330/base -> origin/gh/ydwu4/330/base 2025-12-04T09:43:32.8233603Z * [new branch] gh/ydwu4/330/head -> origin/gh/ydwu4/330/head 2025-12-04T09:43:32.8235371Z * [new branch] gh/ydwu4/330/orig -> origin/gh/ydwu4/330/orig 2025-12-04T09:43:32.8238014Z * [new branch] gh/ydwu4/331/base -> origin/gh/ydwu4/331/base 2025-12-04T09:43:32.8239849Z * [new branch] gh/ydwu4/331/head -> origin/gh/ydwu4/331/head 2025-12-04T09:43:32.8241489Z * [new branch] gh/ydwu4/331/orig -> origin/gh/ydwu4/331/orig 2025-12-04T09:43:32.8243680Z * [new branch] gh/ydwu4/332/base -> origin/gh/ydwu4/332/base 2025-12-04T09:43:32.8245360Z * [new branch] gh/ydwu4/332/head -> origin/gh/ydwu4/332/head 2025-12-04T09:43:32.8247126Z * [new branch] gh/ydwu4/332/orig -> origin/gh/ydwu4/332/orig 2025-12-04T09:43:32.8249286Z * [new branch] gh/ydwu4/333/base -> origin/gh/ydwu4/333/base 2025-12-04T09:43:32.8251035Z * [new branch] gh/ydwu4/333/head -> origin/gh/ydwu4/333/head 2025-12-04T09:43:32.8252736Z * [new branch] gh/ydwu4/333/orig -> origin/gh/ydwu4/333/orig 2025-12-04T09:43:32.8254880Z * [new branch] gh/ydwu4/334/base -> origin/gh/ydwu4/334/base 2025-12-04T09:43:32.8256926Z * [new branch] gh/ydwu4/334/head -> origin/gh/ydwu4/334/head 2025-12-04T09:43:32.8258593Z * [new branch] gh/ydwu4/334/orig -> origin/gh/ydwu4/334/orig 2025-12-04T09:43:32.8260805Z * [new branch] gh/ydwu4/335/base -> origin/gh/ydwu4/335/base 2025-12-04T09:43:32.8262439Z * [new branch] gh/ydwu4/335/head -> origin/gh/ydwu4/335/head 2025-12-04T09:43:32.8264132Z * [new branch] gh/ydwu4/335/orig -> origin/gh/ydwu4/335/orig 2025-12-04T09:43:32.8266959Z * [new branch] gh/ydwu4/337/base -> origin/gh/ydwu4/337/base 2025-12-04T09:43:32.8268829Z * [new branch] gh/ydwu4/337/head -> origin/gh/ydwu4/337/head 2025-12-04T09:43:32.8270457Z * [new branch] gh/ydwu4/337/orig -> origin/gh/ydwu4/337/orig 2025-12-04T09:43:32.8272810Z * [new branch] gh/ydwu4/339/base -> origin/gh/ydwu4/339/base 2025-12-04T09:43:32.8274505Z * [new branch] gh/ydwu4/339/head -> origin/gh/ydwu4/339/head 2025-12-04T09:43:32.8276178Z * [new branch] gh/ydwu4/339/orig -> origin/gh/ydwu4/339/orig 2025-12-04T09:43:32.8279104Z * [new branch] gh/yf225/133/base -> origin/gh/yf225/133/base 2025-12-04T09:43:32.8280741Z * [new branch] gh/yf225/133/head -> origin/gh/yf225/133/head 2025-12-04T09:43:32.8283065Z * [new branch] gh/yf225/93/base -> origin/gh/yf225/93/base 2025-12-04T09:43:32.8284856Z * [new branch] gh/yf225/93/head -> origin/gh/yf225/93/head 2025-12-04T09:43:32.8287966Z * [new branch] gh/yifuwang/152/base -> origin/gh/yifuwang/152/base 2025-12-04T09:43:32.8290032Z * [new branch] gh/yifuwang/152/head -> origin/gh/yifuwang/152/head 2025-12-04T09:43:32.8291839Z * [new branch] gh/yifuwang/152/orig -> origin/gh/yifuwang/152/orig 2025-12-04T09:43:32.8294133Z * [new branch] gh/yifuwang/195/base -> origin/gh/yifuwang/195/base 2025-12-04T09:43:32.8295874Z * [new branch] gh/yifuwang/195/head -> origin/gh/yifuwang/195/head 2025-12-04T09:43:32.8297591Z * [new branch] gh/yifuwang/195/orig -> origin/gh/yifuwang/195/orig 2025-12-04T09:43:32.8300521Z * [new branch] gh/yiming0416/1/base -> origin/gh/yiming0416/1/base 2025-12-04T09:43:32.8302230Z * [new branch] gh/yiming0416/1/head -> origin/gh/yiming0416/1/head 2025-12-04T09:43:32.8304429Z * [new branch] gh/yiming0416/2/base -> origin/gh/yiming0416/2/base 2025-12-04T09:43:32.8306022Z * [new branch] gh/yiming0416/2/head -> origin/gh/yiming0416/2/head 2025-12-04T09:43:32.8309383Z * [new branch] gh/yushangdi/1/base -> origin/gh/yushangdi/1/base 2025-12-04T09:43:32.8311321Z * [new branch] gh/yushangdi/1/head -> origin/gh/yushangdi/1/head 2025-12-04T09:43:32.8313529Z * [new branch] gh/yushangdi/10/base -> origin/gh/yushangdi/10/base 2025-12-04T09:43:32.8315200Z * [new branch] gh/yushangdi/10/head -> origin/gh/yushangdi/10/head 2025-12-04T09:43:32.8316876Z * [new branch] gh/yushangdi/10/orig -> origin/gh/yushangdi/10/orig 2025-12-04T09:43:32.8319187Z * [new branch] gh/yushangdi/11/base -> origin/gh/yushangdi/11/base 2025-12-04T09:43:32.8320927Z * [new branch] gh/yushangdi/11/head -> origin/gh/yushangdi/11/head 2025-12-04T09:43:32.8322688Z * [new branch] gh/yushangdi/11/orig -> origin/gh/yushangdi/11/orig 2025-12-04T09:43:32.8324899Z * [new branch] gh/yushangdi/2/base -> origin/gh/yushangdi/2/base 2025-12-04T09:43:32.8326649Z * [new branch] gh/yushangdi/2/head -> origin/gh/yushangdi/2/head 2025-12-04T09:43:32.8328981Z * [new branch] gh/yushangdi/7/base -> origin/gh/yushangdi/7/base 2025-12-04T09:43:32.8330662Z * [new branch] gh/yushangdi/7/head -> origin/gh/yushangdi/7/head 2025-12-04T09:43:32.8332375Z * [new branch] gh/yushangdi/7/orig -> origin/gh/yushangdi/7/orig 2025-12-04T09:43:32.8334889Z * [new branch] gh/yushangdi/8/base -> origin/gh/yushangdi/8/base 2025-12-04T09:43:32.8336755Z * [new branch] gh/yushangdi/8/head -> origin/gh/yushangdi/8/head 2025-12-04T09:43:32.8338481Z * [new branch] gh/yushangdi/8/orig -> origin/gh/yushangdi/8/orig 2025-12-04T09:43:32.8340725Z * [new branch] gh/yushangdi/9/base -> origin/gh/yushangdi/9/base 2025-12-04T09:43:32.8342491Z * [new branch] gh/yushangdi/9/head -> origin/gh/yushangdi/9/head 2025-12-04T09:43:32.8344244Z * [new branch] gh/yushangdi/9/orig -> origin/gh/yushangdi/9/orig 2025-12-04T09:43:32.8347063Z * [new branch] gh/zklaus/19/base -> origin/gh/zklaus/19/base 2025-12-04T09:43:32.8349005Z * [new branch] gh/zklaus/19/head -> origin/gh/zklaus/19/head 2025-12-04T09:43:32.8350684Z * [new branch] gh/zklaus/19/orig -> origin/gh/zklaus/19/orig 2025-12-04T09:43:32.8353073Z * [new branch] gh/zklaus/20/base -> origin/gh/zklaus/20/base 2025-12-04T09:43:32.8354759Z * [new branch] gh/zklaus/20/head -> origin/gh/zklaus/20/head 2025-12-04T09:43:32.8356688Z * [new branch] gh/zklaus/20/orig -> origin/gh/zklaus/20/orig 2025-12-04T09:43:32.8359026Z * [new branch] gh/zklaus/21/base -> origin/gh/zklaus/21/base 2025-12-04T09:43:32.8360870Z * [new branch] gh/zklaus/21/head -> origin/gh/zklaus/21/head 2025-12-04T09:43:32.8362560Z * [new branch] gh/zklaus/21/orig -> origin/gh/zklaus/21/orig 2025-12-04T09:43:32.8364829Z * [new branch] gh/zklaus/22/base -> origin/gh/zklaus/22/base 2025-12-04T09:43:32.8366518Z * [new branch] gh/zklaus/22/head -> origin/gh/zklaus/22/head 2025-12-04T09:43:32.8368280Z * [new branch] gh/zklaus/22/orig -> origin/gh/zklaus/22/orig 2025-12-04T09:43:32.8370544Z * [new branch] gh/zklaus/23/base -> origin/gh/zklaus/23/base 2025-12-04T09:43:32.8372260Z * [new branch] gh/zklaus/23/head -> origin/gh/zklaus/23/head 2025-12-04T09:43:32.8373979Z * [new branch] gh/zklaus/23/orig -> origin/gh/zklaus/23/orig 2025-12-04T09:43:32.8376171Z * [new branch] gh/zklaus/24/base -> origin/gh/zklaus/24/base 2025-12-04T09:43:32.8377870Z * [new branch] gh/zklaus/24/head -> origin/gh/zklaus/24/head 2025-12-04T09:43:32.8379601Z * [new branch] gh/zklaus/24/orig -> origin/gh/zklaus/24/orig 2025-12-04T09:43:32.8382636Z * [new branch] gh/zou3519/1197/base -> origin/gh/zou3519/1197/base 2025-12-04T09:43:32.8384186Z * [new branch] gh/zou3519/1197/head -> origin/gh/zou3519/1197/head 2025-12-04T09:43:32.8385843Z * [new branch] gh/zou3519/1197/orig -> origin/gh/zou3519/1197/orig 2025-12-04T09:43:32.8388672Z * [new branch] gh/zou3519/1199/base -> origin/gh/zou3519/1199/base 2025-12-04T09:43:32.8390503Z * [new branch] gh/zou3519/1199/head -> origin/gh/zou3519/1199/head 2025-12-04T09:43:32.8392240Z * [new branch] gh/zou3519/1199/orig -> origin/gh/zou3519/1199/orig 2025-12-04T09:43:32.8394547Z * [new branch] gh/zou3519/1200/base -> origin/gh/zou3519/1200/base 2025-12-04T09:43:32.8396275Z * [new branch] gh/zou3519/1200/head -> origin/gh/zou3519/1200/head 2025-12-04T09:43:32.8398516Z * [new branch] gh/zou3519/1200/orig -> origin/gh/zou3519/1200/orig 2025-12-04T09:43:32.8400888Z * [new branch] gh/zou3519/1201/base -> origin/gh/zou3519/1201/base 2025-12-04T09:43:32.8402514Z * [new branch] gh/zou3519/1201/head -> origin/gh/zou3519/1201/head 2025-12-04T09:43:32.8404229Z * [new branch] gh/zou3519/1201/orig -> origin/gh/zou3519/1201/orig 2025-12-04T09:43:32.8406342Z * [new branch] gh/zou3519/1202/base -> origin/gh/zou3519/1202/base 2025-12-04T09:43:32.8408094Z * [new branch] gh/zou3519/1202/head -> origin/gh/zou3519/1202/head 2025-12-04T09:43:32.8409869Z * [new branch] gh/zou3519/1202/orig -> origin/gh/zou3519/1202/orig 2025-12-04T09:43:32.8412939Z * [new branch] gh/zpcore/1/base -> origin/gh/zpcore/1/base 2025-12-04T09:43:32.8414609Z * [new branch] gh/zpcore/1/head -> origin/gh/zpcore/1/head 2025-12-04T09:43:32.8416993Z * [new branch] gh/zpcore/11/base -> origin/gh/zpcore/11/base 2025-12-04T09:43:32.8418757Z * [new branch] gh/zpcore/11/head -> origin/gh/zpcore/11/head 2025-12-04T09:43:32.8420853Z * [new branch] gh/zpcore/11/orig -> origin/gh/zpcore/11/orig 2025-12-04T09:43:32.8423441Z * [new branch] gh/zpcore/12/base -> origin/gh/zpcore/12/base 2025-12-04T09:43:32.8425171Z * [new branch] gh/zpcore/12/head -> origin/gh/zpcore/12/head 2025-12-04T09:43:32.8426951Z * [new branch] gh/zpcore/12/orig -> origin/gh/zpcore/12/orig 2025-12-04T09:43:32.8429496Z * [new branch] gh/zpcore/13/base -> origin/gh/zpcore/13/base 2025-12-04T09:43:32.8431107Z * [new branch] gh/zpcore/13/head -> origin/gh/zpcore/13/head 2025-12-04T09:43:32.8432792Z * [new branch] gh/zpcore/13/orig -> origin/gh/zpcore/13/orig 2025-12-04T09:43:32.8435094Z * [new branch] gh/zpcore/14/base -> origin/gh/zpcore/14/base 2025-12-04T09:43:32.8436944Z * [new branch] gh/zpcore/14/head -> origin/gh/zpcore/14/head 2025-12-04T09:43:32.8438619Z * [new branch] gh/zpcore/14/orig -> origin/gh/zpcore/14/orig 2025-12-04T09:43:32.8441135Z * [new branch] gh/zpcore/15/base -> origin/gh/zpcore/15/base 2025-12-04T09:43:32.8442779Z * [new branch] gh/zpcore/15/head -> origin/gh/zpcore/15/head 2025-12-04T09:43:32.8444548Z * [new branch] gh/zpcore/15/orig -> origin/gh/zpcore/15/orig 2025-12-04T09:43:32.8446867Z * [new branch] gh/zpcore/2/base -> origin/gh/zpcore/2/base 2025-12-04T09:43:32.8448606Z * [new branch] gh/zpcore/2/head -> origin/gh/zpcore/2/head 2025-12-04T09:43:32.8451398Z * [new branch] gh/zpcore/21/base -> origin/gh/zpcore/21/base 2025-12-04T09:43:32.8453268Z * [new branch] gh/zpcore/21/head -> origin/gh/zpcore/21/head 2025-12-04T09:43:32.8454977Z * [new branch] gh/zpcore/21/orig -> origin/gh/zpcore/21/orig 2025-12-04T09:43:32.8459361Z * [new branch] gh/zpcore/22/base -> origin/gh/zpcore/22/base 2025-12-04T09:43:32.8461129Z * [new branch] gh/zpcore/22/head -> origin/gh/zpcore/22/head 2025-12-04T09:43:32.8462964Z * [new branch] gh/zpcore/22/orig -> origin/gh/zpcore/22/orig 2025-12-04T09:43:32.8465347Z * [new branch] gh/zpcore/23/base -> origin/gh/zpcore/23/base 2025-12-04T09:43:32.8467089Z * [new branch] gh/zpcore/23/head -> origin/gh/zpcore/23/head 2025-12-04T09:43:32.8468956Z * [new branch] gh/zpcore/23/orig -> origin/gh/zpcore/23/orig 2025-12-04T09:43:32.8471030Z * [new branch] gh/zpcore/24/base -> origin/gh/zpcore/24/base 2025-12-04T09:43:32.8472758Z * [new branch] gh/zpcore/24/head -> origin/gh/zpcore/24/head 2025-12-04T09:43:32.8474482Z * [new branch] gh/zpcore/24/orig -> origin/gh/zpcore/24/orig 2025-12-04T09:43:32.8476928Z * [new branch] gh/zpcore/25/base -> origin/gh/zpcore/25/base 2025-12-04T09:43:32.8478575Z * [new branch] gh/zpcore/25/head -> origin/gh/zpcore/25/head 2025-12-04T09:43:32.8480308Z * [new branch] gh/zpcore/25/orig -> origin/gh/zpcore/25/orig 2025-12-04T09:43:32.8482726Z * [new branch] gh/zpcore/26/base -> origin/gh/zpcore/26/base 2025-12-04T09:43:32.8484542Z * [new branch] gh/zpcore/26/head -> origin/gh/zpcore/26/head 2025-12-04T09:43:32.8486326Z * [new branch] gh/zpcore/26/orig -> origin/gh/zpcore/26/orig 2025-12-04T09:43:32.8488679Z * [new branch] gh/zpcore/27/base -> origin/gh/zpcore/27/base 2025-12-04T09:43:32.8490391Z * [new branch] gh/zpcore/27/head -> origin/gh/zpcore/27/head 2025-12-04T09:43:32.8492104Z * [new branch] gh/zpcore/27/orig -> origin/gh/zpcore/27/orig 2025-12-04T09:43:32.8494849Z * [new branch] gh/zpcore/28/base -> origin/gh/zpcore/28/base 2025-12-04T09:43:32.8496873Z * [new branch] gh/zpcore/28/head -> origin/gh/zpcore/28/head 2025-12-04T09:43:32.8498617Z * [new branch] gh/zpcore/28/orig -> origin/gh/zpcore/28/orig 2025-12-04T09:43:32.8500856Z * [new branch] gh/zpcore/3/base -> origin/gh/zpcore/3/base 2025-12-04T09:43:32.8502554Z * [new branch] gh/zpcore/3/head -> origin/gh/zpcore/3/head 2025-12-04T09:43:32.8504684Z * [new branch] gh/zpcore/4/base -> origin/gh/zpcore/4/base 2025-12-04T09:43:32.8506359Z * [new branch] gh/zpcore/4/head -> origin/gh/zpcore/4/head 2025-12-04T09:43:32.8508792Z * [new branch] gh/zpcore/5/base -> origin/gh/zpcore/5/base 2025-12-04T09:43:32.8510423Z * [new branch] gh/zpcore/5/head -> origin/gh/zpcore/5/head 2025-12-04T09:43:32.8512611Z * [new branch] gh/zpcore/6/base -> origin/gh/zpcore/6/base 2025-12-04T09:43:32.8514304Z * [new branch] gh/zpcore/6/head -> origin/gh/zpcore/6/head 2025-12-04T09:43:32.8516879Z * [new branch] gh/zpcore/7/base -> origin/gh/zpcore/7/base 2025-12-04T09:43:32.8518536Z * [new branch] gh/zpcore/7/head -> origin/gh/zpcore/7/head 2025-12-04T09:43:32.8520837Z * [new branch] gh/zpcore/8/base -> origin/gh/zpcore/8/base 2025-12-04T09:43:32.8522594Z * [new branch] gh/zpcore/8/head -> origin/gh/zpcore/8/head 2025-12-04T09:43:32.8524456Z * [new branch] google-main -> origin/google-main 2025-12-04T09:43:32.8526995Z * [new branch] guangyey/external_stream -> origin/guangyey/external_stream 2025-12-04T09:43:32.8528881Z * [new branch] guangyey/test_2025 -> origin/guangyey/test_2025 2025-12-04T09:43:32.8531698Z * [new branch] guilhermeleobas/cherry-pick-55d87d9dfd9 -> origin/guilhermeleobas/cherry-pick-55d87d9dfd9 2025-12-04T09:43:32.8534381Z * [new branch] hameerabbasi/complex_tensor_subclass -> origin/hameerabbasi/complex_tensor_subclass 2025-12-04T09:43:32.8536139Z * [new branch] hameerabbasi/fix-ctensor-gradcheck-tests -> origin/hameerabbasi/fix-ctensor-gradcheck-tests 2025-12-04T09:43:32.8537775Z * [new branch] hameerabbasi/gradcheck-allclose -> origin/hameerabbasi/gradcheck-allclose 2025-12-04T09:43:32.8539417Z * [new branch] hc_baseline -> origin/hc_baseline 2025-12-04T09:43:32.8541253Z * [new branch] hhh_rand -> origin/hhh_rand 2025-12-04T09:43:32.8543550Z * [new branch] huba/f1 -> origin/huba/f1 2025-12-04T09:43:32.8545872Z * [new branch] increase-timeout-linux-jammy-cuda12_8-py3_10-gcc11-test -> origin/increase-timeout-linux-jammy-cuda12_8-py3_10-gcc11-test 2025-12-04T09:43:32.8547477Z * [new branch] inlining -> origin/inlining 2025-12-04T09:43:32.8549332Z * [new branch] inlining-ezyang -> origin/inlining-ezyang 2025-12-04T09:43:32.8551109Z * [new branch] install-torchao-0.13.0 -> origin/install-torchao-0.13.0 2025-12-04T09:43:32.8553141Z * [new branch] instrument-trunk-pull-linux-with-job-test-filters -> origin/instrument-trunk-pull-linux-with-job-test-filters 2025-12-04T09:43:32.8554783Z * [new branch] invoke-subgraph -> origin/invoke-subgraph 2025-12-04T09:43:32.8556864Z * [new branch] issue#58739 -> origin/issue#58739 2025-12-04T09:43:32.8558712Z * [new branch] jainapurva-patch-1 -> origin/jainapurva-patch-1 2025-12-04T09:43:32.8561119Z * [new branch] jathu/o3 -> origin/jathu/o3 2025-12-04T09:43:32.8562803Z * [new branch] jathu/sve -> origin/jathu/sve 2025-12-04T09:43:32.8565290Z * [new branch] jcaip/test-cusparselt-version-0.6.2 -> origin/jcaip/test-cusparselt-version-0.6.2 2025-12-04T09:43:32.8567100Z * [new branch] jcaip/update-cusparselt-0.6.2 -> origin/jcaip/update-cusparselt-0.6.2 2025-12-04T09:43:32.8569422Z * [new branch] jiannanWang/memorysnapshot_filter -> origin/jiannanWang/memorysnapshot_filter 2025-12-04T09:43:32.8571055Z * [new branch] jiannanWang/profilerstepwarning -> origin/jiannanWang/profilerstepwarning 2025-12-04T09:43:32.8572861Z * [new branch] jithunnair-amd-patch-1 -> origin/jithunnair-amd-patch-1 2025-12-04T09:43:32.8574689Z * [new branch] jithunnair-amd-patch-10 -> origin/jithunnair-amd-patch-10 2025-12-04T09:43:32.8576496Z * [new branch] jithunnair-amd-patch-2 -> origin/jithunnair-amd-patch-2 2025-12-04T09:43:32.8578380Z * [new branch] jithunnair-amd-patch-3 -> origin/jithunnair-amd-patch-3 2025-12-04T09:43:32.8580156Z * [new branch] jithunnair-amd-patch-4 -> origin/jithunnair-amd-patch-4 2025-12-04T09:43:32.8581860Z * [new branch] jithunnair-amd-patch-5 -> origin/jithunnair-amd-patch-5 2025-12-04T09:43:32.8583754Z * [new branch] jithunnair-amd-patch-6 -> origin/jithunnair-amd-patch-6 2025-12-04T09:43:32.8585567Z * [new branch] jithunnair-amd-patch-7 -> origin/jithunnair-amd-patch-7 2025-12-04T09:43:32.8587428Z * [new branch] jithunnair-amd-patch-8 -> origin/jithunnair-amd-patch-8 2025-12-04T09:43:32.8589253Z * [new branch] jithunnair-amd-patch-9 -> origin/jithunnair-amd-patch-9 2025-12-04T09:43:32.8591670Z * [new branch] justinchu/native-qdq -> origin/justinchu/native-qdq 2025-12-04T09:43:32.8594091Z * [new branch] kainan666/xlf_debug -> origin/kainan666/xlf_debug 2025-12-04T09:43:32.8595636Z * [new branch] kainan_test -> origin/kainan_test 2025-12-04T09:43:32.8597508Z * [new branch] larryliu0820-patch-1 -> origin/larryliu0820-patch-1 2025-12-04T09:43:32.8599894Z * [new branch] leslie/test_group_gemm_epilogues -> origin/leslie/test_group_gemm_epilogues 2025-12-04T09:43:32.8602242Z * [new branch] lessw2020/fix_cutlass_cache_error -> origin/lessw2020/fix_cutlass_cache_error 2025-12-04T09:43:32.8604482Z * [new branch] liaoxuan/shm_all_reduce -> origin/liaoxuan/shm_all_reduce 2025-12-04T09:43:32.8606229Z * [new branch] liaoxuan/test_fa_disable_softmax -> origin/liaoxuan/test_fa_disable_softmax 2025-12-04T09:43:32.8607846Z * [new branch] liaoxuan/test_int8_sdpa -> origin/liaoxuan/test_int8_sdpa 2025-12-04T09:43:32.8609549Z * [new branch] llama4-stable -> origin/llama4-stable 2025-12-04T09:43:32.8612390Z * [new branch] lts/release/1.8 -> origin/lts/release/1.8 2025-12-04T09:43:32.8614820Z * [new branch] lucaskabela/#94773 -> origin/lucaskabela/#94773 2025-12-04T09:43:32.8616411Z * [new branch] lucaskabela/fix_164876 -> origin/lucaskabela/fix_164876 2025-12-04T09:43:32.8618101Z * [new branch] lucaskabela/flop_counter -> origin/lucaskabela/flop_counter 2025-12-04T09:43:32.8619794Z * [new branch] lucaskabela/func_under_decomp -> origin/lucaskabela/func_under_decomp 2025-12-04T09:43:32.8621425Z * [new branch] lucaskabela/functional_in_dynamo -> origin/lucaskabela/functional_in_dynamo 2025-12-04T09:43:32.8623154Z * [new branch] lucaskabela/install_params_as_graph_attr -> origin/lucaskabela/install_params_as_graph_attr 2025-12-04T09:43:32.8625175Z * [new branch] lucaskabela/parameters_as_graph_attr -> origin/lucaskabela/parameters_as_graph_attr 2025-12-04T09:43:32.8627355Z * [new branch] lucaskabela/remove_aot_dispatcher_metadata -> origin/lucaskabela/remove_aot_dispatcher_metadata 2025-12-04T09:43:32.8629109Z * [new branch] lucaskabela/rnn_decomp -> origin/lucaskabela/rnn_decomp 2025-12-04T09:43:32.8630962Z * [new branch] lucaskabela/typing_backends -> origin/lucaskabela/typing_backends 2025-12-04T09:43:32.8632713Z * [new branch] lucaskabela/typing_ctx_manager -> origin/lucaskabela/typing_ctx_manager 2025-12-04T09:43:32.8634449Z * [new branch] lucaskabela/typing_nn_module -> origin/lucaskabela/typing_nn_module 2025-12-04T09:43:32.8636227Z * [new branch] lucaskabela/typing_user_defined -> origin/lucaskabela/typing_user_defined 2025-12-04T09:43:32.8638413Z * [new branch] lucaskabela/typing_variables -> origin/lucaskabela/typing_variables 2025-12-04T09:43:32.8640206Z * [new branch] lucaskabela/typing_variables_dicts -> origin/lucaskabela/typing_variables_dicts 2025-12-04T09:43:32.8641937Z * [new branch] lucaskabela/typing_variables_functions -> origin/lucaskabela/typing_variables_functions 2025-12-04T09:43:32.8643607Z * [new branch] lucaskabela/typing_variables_lists -> origin/lucaskabela/typing_variables_lists 2025-12-04T09:43:32.8646063Z * [new branch] lw/torch_box_by_ref -> origin/lw/torch_box_by_ref 2025-12-04T09:43:32.8647834Z * [new branch] main -> origin/main 2025-12-04T09:43:32.8649720Z * [new branch] malfet-patch-1 -> origin/malfet-patch-1 2025-12-04T09:43:32.8651582Z * [new branch] malfet-patch-2 -> origin/malfet-patch-2 2025-12-04T09:43:32.8653479Z * [new branch] malfet-patch-3 -> origin/malfet-patch-3 2025-12-04T09:43:32.8655581Z * [new branch] malfet-patch-4 -> origin/malfet-patch-4 2025-12-04T09:43:32.8657365Z * [new branch] malfet-patch-5 -> origin/malfet-patch-5 2025-12-04T09:43:32.8659224Z * [new branch] malfet-patch-6 -> origin/malfet-patch-6 2025-12-04T09:43:32.8661136Z * [new branch] malfet-patch-7 -> origin/malfet-patch-7 2025-12-04T09:43:32.8662966Z * [new branch] malfet-patch-8 -> origin/malfet-patch-8 2025-12-04T09:43:32.8665320Z * [new branch] malfet/add-3.14-ci -> origin/malfet/add-3.14-ci 2025-12-04T09:43:32.8667108Z * [new branch] malfet/be-do-not-make-typos-in-build-artifacts -> origin/malfet/be-do-not-make-typos-in-build-artifacts 2025-12-04T09:43:32.8668938Z * [new branch] malfet/be-move-more-settings-to-checkout-pytorch -> origin/malfet/be-move-more-settings-to-checkout-pytorch 2025-12-04T09:43:32.8670783Z * [new branch] malfet/be-remove-misisng-neon-headers -> origin/malfet/be-remove-misisng-neon-headers 2025-12-04T09:43:32.8672765Z * [new branch] malfet/mps-implement-col2im -> origin/malfet/mps-implement-col2im 2025-12-04T09:43:32.8675654Z * [new branch] manuel/aoti_metal_shimify-thread_safe -> origin/manuel/aoti_metal_shimify-thread_safe 2025-12-04T09:43:32.8677242Z * [new branch] manuel/inductor_link_openmp -> origin/manuel/inductor_link_openmp 2025-12-04T09:43:32.8679572Z * [new branch] masnesral/metaconda -> origin/masnesral/metaconda 2025-12-04T09:43:32.8681476Z * [new branch] mem_profiler_flaky_fix -> origin/mem_profiler_flaky_fix 2025-12-04T09:43:32.8683221Z * [new branch] mem_profiler_stack_trace -> origin/mem_profiler_stack_trace 2025-12-04T09:43:32.8685085Z * [new branch] memory_profiler_stack -> origin/memory_profiler_stack 2025-12-04T09:43:32.8686891Z * [new branch] metascroy-patch-1 -> origin/metascroy-patch-1 2025-12-04T09:43:32.8688682Z * [new branch] mingw_posix -> origin/mingw_posix 2025-12-04T09:43:32.8691090Z * [new branch] mlazos/S429861-debug -> origin/mlazos/S429861-debug 2025-12-04T09:43:32.8692709Z * [new branch] mlazos/aa -> origin/mlazos/aa 2025-12-04T09:43:32.8694345Z * [new branch] mlazos/acts -> origin/mlazos/acts 2025-12-04T09:43:32.8696043Z * [new branch] mlazos/arg-renames -> origin/mlazos/arg-renames 2025-12-04T09:43:32.8697692Z * [new branch] mlazos/bad-cudagraphs -> origin/mlazos/bad-cudagraphs 2025-12-04T09:43:32.8699388Z * [new branch] mlazos/baseline-graph-breaks -> origin/mlazos/baseline-graph-breaks 2025-12-04T09:43:32.8701031Z * [new branch] mlazos/beta-tensor -> origin/mlazos/beta-tensor 2025-12-04T09:43:32.8702683Z * [new branch] mlazos/buffers -> origin/mlazos/buffers 2025-12-04T09:43:32.8704221Z * [new branch] mlazos/buffers2 -> origin/mlazos/buffers2 2025-12-04T09:43:32.8706227Z * [new branch] mlazos/buffers3 -> origin/mlazos/buffers3 2025-12-04T09:43:32.8708425Z * [new branch] mlazos/bwd -> origin/mlazos/bwd 2025-12-04T09:43:32.8710092Z * [new branch] mlazos/combo-test -> origin/mlazos/combo-test 2025-12-04T09:43:32.8711906Z * [new branch] mlazos/ctx-cleanup -> origin/mlazos/ctx-cleanup 2025-12-04T09:43:32.8713642Z * [new branch] mlazos/cuda-cmd-log -> origin/mlazos/cuda-cmd-log 2025-12-04T09:43:32.8715526Z * [new branch] mlazos/cudagraph-tests -> origin/mlazos/cudagraph-tests 2025-12-04T09:43:32.8717302Z * [new branch] mlazos/cudagraphs-measurement -> origin/mlazos/cudagraphs-measurement 2025-12-04T09:43:32.8719085Z * [new branch] mlazos/cutlass-test -> origin/mlazos/cutlass-test 2025-12-04T09:43:32.8720931Z * [new branch] mlazos/cutlass-topo-bug -> origin/mlazos/cutlass-topo-bug 2025-12-04T09:43:32.8722555Z * [new branch] mlazos/dataclass-proxy -> origin/mlazos/dataclass-proxy 2025-12-04T09:43:32.8724277Z * [new branch] mlazos/dc-attrs -> origin/mlazos/dc-attrs 2025-12-04T09:43:32.8726031Z * [new branch] mlazos/dc-helion -> origin/mlazos/dc-helion 2025-12-04T09:43:32.8727842Z * [new branch] mlazos/dict-fix -> origin/mlazos/dict-fix 2025-12-04T09:43:32.8729588Z * [new branch] mlazos/disable-tf -> origin/mlazos/disable-tf 2025-12-04T09:43:32.8731294Z * [new branch] mlazos/dupe-fix -> origin/mlazos/dupe-fix 2025-12-04T09:43:32.8733180Z * [new branch] mlazos/dyn-batch -> origin/mlazos/dyn-batch 2025-12-04T09:43:32.8734846Z * [new branch] mlazos/evt -> origin/mlazos/evt 2025-12-04T09:43:32.8736694Z * [new branch] mlazos/extract-examples -> origin/mlazos/extract-examples 2025-12-04T09:43:32.8738409Z * [new branch] mlazos/foreach-op -> origin/mlazos/foreach-op 2025-12-04T09:43:32.8740291Z * [new branch] mlazos/fp8 -> origin/mlazos/fp8 2025-12-04T09:43:32.8742033Z * [new branch] mlazos/fp8-bias -> origin/mlazos/fp8-bias 2025-12-04T09:43:32.8743821Z * [new branch] mlazos/fp8-bias-fusion -> origin/mlazos/fp8-bias-fusion 2025-12-04T09:43:32.8745554Z * [new branch] mlazos/fp8-fixes -> origin/mlazos/fp8-fixes 2025-12-04T09:43:32.8747365Z * [new branch] mlazos/freezing -> origin/mlazos/freezing 2025-12-04T09:43:32.8749222Z * [new branch] mlazos/h-comp -> origin/mlazos/h-comp 2025-12-04T09:43:32.8750984Z * [new branch] mlazos/h-comp2 -> origin/mlazos/h-comp2 2025-12-04T09:43:32.8752748Z * [new branch] mlazos/hash-hop -> origin/mlazos/hash-hop 2025-12-04T09:43:32.8754497Z * [new branch] mlazos/hc -> origin/mlazos/hc 2025-12-04T09:43:32.8756483Z * [new branch] mlazos/hc-cycles -> origin/mlazos/hc-cycles 2025-12-04T09:43:32.8758239Z * [new branch] mlazos/hc-fixes -> origin/mlazos/hc-fixes 2025-12-04T09:43:32.8760062Z * [new branch] mlazos/hc-fixes3 -> origin/mlazos/hc-fixes3 2025-12-04T09:43:32.8761785Z * [new branch] mlazos/hc-fixes4 -> origin/mlazos/hc-fixes4 2025-12-04T09:43:32.8764024Z * [new branch] mlazos/hc-hf -> origin/mlazos/hc-hf 2025-12-04T09:43:32.8765818Z * [new branch] mlazos/hc-mut -> origin/mlazos/hc-mut 2025-12-04T09:43:32.8767596Z * [new branch] mlazos/hc10 -> origin/mlazos/hc10 2025-12-04T09:43:32.8769447Z * [new branch] mlazos/hc11 -> origin/mlazos/hc11 2025-12-04T09:43:32.8771231Z * [new branch] mlazos/hc12 -> origin/mlazos/hc12 2025-12-04T09:43:32.8772964Z * [new branch] mlazos/hc13 -> origin/mlazos/hc13 2025-12-04T09:43:32.8774711Z * [new branch] mlazos/hc14 -> origin/mlazos/hc14 2025-12-04T09:43:32.8776451Z * [new branch] mlazos/hc15 -> origin/mlazos/hc15 2025-12-04T09:43:32.8778237Z * [new branch] mlazos/hc2 -> origin/mlazos/hc2 2025-12-04T09:43:32.8780049Z * [new branch] mlazos/hc4 -> origin/mlazos/hc4 2025-12-04T09:43:32.8781806Z * [new branch] mlazos/hc5 -> origin/mlazos/hc5 2025-12-04T09:43:32.8783535Z * [new branch] mlazos/hc6 -> origin/mlazos/hc6 2025-12-04T09:43:32.8785226Z * [new branch] mlazos/hc7 -> origin/mlazos/hc7 2025-12-04T09:43:32.8787051Z * [new branch] mlazos/hc8 -> origin/mlazos/hc8 2025-12-04T09:43:32.8788811Z * [new branch] mlazos/hc9 -> origin/mlazos/hc9 2025-12-04T09:43:32.8790587Z * [new branch] mlazos/hc_baseline2 -> origin/mlazos/hc_baseline2 2025-12-04T09:43:32.8792238Z * [new branch] mlazos/inductor-streams -> origin/mlazos/inductor-streams 2025-12-04T09:43:32.8793816Z * [new branch] mlazos/main -> origin/mlazos/main 2025-12-04T09:43:32.8795604Z * [new branch] mlazos/mcg2 -> origin/mlazos/mcg2 2025-12-04T09:43:32.8797445Z * [new branch] mlazos/meta-guards -> origin/mlazos/meta-guards 2025-12-04T09:43:32.8799862Z * [new branch] mlazos/mlazos/foreach-map-adam -> origin/mlazos/mlazos/foreach-map-adam 2025-12-04T09:43:32.8801650Z * [new branch] mlazos/mlazos/tf-mode-backup -> origin/mlazos/mlazos/tf-mode-backup 2025-12-04T09:43:32.8803336Z * [new branch] mlazos/mod-fix -> origin/mlazos/mod-fix 2025-12-04T09:43:32.8805652Z * [new branch] mlazos/mode-fix -> origin/mlazos/mode-fix 2025-12-04T09:43:32.8807413Z * [new branch] mlazos/offsets -> origin/mlazos/offsets 2025-12-04T09:43:32.8809087Z * [new branch] mlazos/overguarding -> origin/mlazos/overguarding 2025-12-04T09:43:32.8810890Z * [new branch] mlazos/proxy-ctors -> origin/mlazos/proxy-ctors 2025-12-04T09:43:32.8812652Z * [new branch] mlazos/quant-fix -> origin/mlazos/quant-fix 2025-12-04T09:43:32.8814416Z * [new branch] mlazos/resnet-fix -> origin/mlazos/resnet-fix 2025-12-04T09:43:32.8816171Z * [new branch] mlazos/rm-buf-names -> origin/mlazos/rm-buf-names 2025-12-04T09:43:32.8817927Z * [new branch] mlazos/rm-code -> origin/mlazos/rm-code 2025-12-04T09:43:32.8819702Z * [new branch] mlazos/rm-spam -> origin/mlazos/rm-spam 2025-12-04T09:43:32.8821543Z * [new branch] mlazos/rtp -> origin/mlazos/rtp 2025-12-04T09:43:32.8823300Z * [new branch] mlazos/static-idx-dbg -> origin/mlazos/static-idx-dbg 2025-12-04T09:43:32.8825187Z * [new branch] mlazos/static-inputs-log -> origin/mlazos/static-inputs-log 2025-12-04T09:43:32.8826786Z * [new branch] mlazos/stests -> origin/mlazos/stests 2025-12-04T09:43:32.8828695Z * [new branch] mlazos/stream-ops -> origin/mlazos/stream-ops 2025-12-04T09:43:32.8830431Z * [new branch] mlazos/td-fix2 -> origin/mlazos/td-fix2 2025-12-04T09:43:32.8832195Z * [new branch] mlazos/tensor-hasattr2 -> origin/mlazos/tensor-hasattr2 2025-12-04T09:43:32.8834009Z * [new branch] mlazos/test -> origin/mlazos/test 2025-12-04T09:43:32.8835723Z * [new branch] mlazos/tf-mode -> origin/mlazos/tf-mode 2025-12-04T09:43:32.8837500Z * [new branch] mlazos/tf-mode-backup2 -> origin/mlazos/tf-mode-backup2 2025-12-04T09:43:32.8839327Z * [new branch] mlazos/tf-mode-reland -> origin/mlazos/tf-mode-reland 2025-12-04T09:43:32.8841141Z * [new branch] mlazos/tf-mode-reland2 -> origin/mlazos/tf-mode-reland2 2025-12-04T09:43:32.8842915Z * [new branch] mlazos/tf-mode-reland3 -> origin/mlazos/tf-mode-reland3 2025-12-04T09:43:32.8844679Z * [new branch] mlazos/triton-no-epi -> origin/mlazos/triton-no-epi 2025-12-04T09:43:32.8846493Z * [new branch] mlazos/tune-proto -> origin/mlazos/tune-proto 2025-12-04T09:43:32.8848236Z * [new branch] mlazos/tuple-fixes -> origin/mlazos/tuple-fixes 2025-12-04T09:43:32.8850085Z * [new branch] mlazos/tuple-fixes2 -> origin/mlazos/tuple-fixes2 2025-12-04T09:43:32.8851852Z * [new branch] mlazos/tuple-handling -> origin/mlazos/tuple-handling 2025-12-04T09:43:32.8853723Z * [new branch] mlazos/user-stream-base -> origin/mlazos/user-stream-base 2025-12-04T09:43:32.8855562Z * [new branch] mlazos/user-streams -> origin/mlazos/user-streams 2025-12-04T09:43:32.8858716Z * [new branch] mlazos/user-streams-backup -> origin/mlazos/user-streams-backup 2025-12-04T09:43:32.8860516Z * [new branch] mlazos/user-streams-backup2 -> origin/mlazos/user-streams-backup2 2025-12-04T09:43:32.8862244Z * [new branch] mlazos/vary-beta -> origin/mlazos/vary-beta 2025-12-04T09:43:32.8863997Z * [new branch] mlazos/vary-beta2 -> origin/mlazos/vary-beta2 2025-12-04T09:43:32.8865756Z * [new branch] mlazos/weird-perf1 -> origin/mlazos/weird-perf1 2025-12-04T09:43:32.8867629Z * [new branch] mm_out_dtype_compile -> origin/mm_out_dtype_compile 2025-12-04T09:43:32.8869480Z * [new branch] module-shim -> origin/module-shim 2025-12-04T09:43:32.8871263Z * [new branch] move_config -> origin/move_config 2025-12-04T09:43:32.8873586Z * [new branch] msaroufim/reduce -> origin/msaroufim/reduce 2025-12-04T09:43:32.8875902Z * [new branch] mtia/basic-cmake -> origin/mtia/basic-cmake 2025-12-04T09:43:32.8878324Z * [new branch] mwizak/fix-triton-block-shape -> origin/mwizak/fix-triton-block-shape 2025-12-04T09:43:32.8880114Z * [new branch] my_varlen_backup -> origin/my_varlen_backup 2025-12-04T09:43:32.8881903Z * [new branch] nativert_num_outputs -> origin/nativert_num_outputs 2025-12-04T09:43:32.8884027Z * [new branch] new-codegen -> origin/new-codegen 2025-12-04T09:43:32.8885866Z * [new branch] newtest-base -> origin/newtest-base 2025-12-04T09:43:32.8888152Z * [new branch] ngimel/addmm_dtype -> origin/ngimel/addmm_dtype 2025-12-04T09:43:32.8889770Z * [new branch] ngimel/div_inv -> origin/ngimel/div_inv 2025-12-04T09:43:32.8891515Z * [new branch] ngimel/error_index_list -> origin/ngimel/error_index_list 2025-12-04T09:43:32.8893449Z * [new branch] ngimel/gather_grid -> origin/ngimel/gather_grid 2025-12-04T09:43:32.8895150Z * [new branch] ngimel/gather_grid_release -> origin/ngimel/gather_grid_release 2025-12-04T09:43:32.8896752Z * [new branch] ngimel/gg_new -> origin/ngimel/gg_new 2025-12-04T09:43:32.8898410Z * [new branch] ngimel/hostalloc -> origin/ngimel/hostalloc 2025-12-04T09:43:32.8900034Z * [new branch] ngimel/storage_id -> origin/ngimel/storage_id 2025-12-04T09:43:32.8901875Z * [new branch] nightly -> origin/nightly 2025-12-04T09:43:32.8904326Z * [new branch] nikitaved/addmm_1_rowcol_lt_path_check -> origin/nikitaved/addmm_1_rowcol_lt_path_check 2025-12-04T09:43:32.8906074Z * [new branch] nikitaved/addmm_epilogue_fusions_2d_bias -> origin/nikitaved/addmm_epilogue_fusions_2d_bias 2025-12-04T09:43:32.8907796Z * [new branch] nikitaved/addmm_epilogue_fusions_inductor -> origin/nikitaved/addmm_epilogue_fusions_inductor 2025-12-04T09:43:32.8909741Z * [new branch] nikitaved/addmm_epilogue_fusions_scratch -> origin/nikitaved/addmm_epilogue_fusions_scratch 2025-12-04T09:43:32.8911751Z * [new branch] nikitaved/grad_addmm_epilogue_fusions -> origin/nikitaved/grad_addmm_epilogue_fusions 2025-12-04T09:43:32.8913826Z * [new branch] nikitaved/simpler_can_use_32bit_index -> origin/nikitaved/simpler_can_use_32bit_index 2025-12-04T09:43:32.8915651Z * [new branch] nikitaved/test -> origin/nikitaved/test 2025-12-04T09:43:32.8918051Z * [new branch] nmacchioni-perf-test-async-autotune -> origin/nmacchioni-perf-test-async-autotune 2025-12-04T09:43:32.8919761Z * [new branch] no_distributed_log_spew -> origin/no_distributed_log_spew 2025-12-04T09:43:32.8921574Z * [new branch] nofun-hack -> origin/nofun-hack 2025-12-04T09:43:32.8923352Z * [new branch] norm_bench -> origin/norm_bench 2025-12-04T09:43:32.8926196Z * [new branch] nullplay/fuse_matmul -> origin/nullplay/fuse_matmul 2025-12-04T09:43:32.8927921Z * [new branch] nullplay_fuse_matmul -> origin/nullplay_fuse_matmul 2025-12-04T09:43:32.8929743Z * [new branch] optimizer_test -> origin/optimizer_test 2025-12-04T09:43:32.8932679Z * [new branch] orig/release/1.10 -> origin/orig/release/1.10 2025-12-04T09:43:32.8934420Z * [new branch] orig/release/1.11 -> origin/orig/release/1.11 2025-12-04T09:43:32.8936151Z * [new branch] orig/release/1.12 -> origin/orig/release/1.12 2025-12-04T09:43:32.8938086Z * [new branch] orig/release/1.13 -> origin/orig/release/1.13 2025-12-04T09:43:32.8939862Z * [new branch] orig/release/1.6 -> origin/orig/release/1.6 2025-12-04T09:43:32.8941705Z * [new branch] orig/release/1.7 -> origin/orig/release/1.7 2025-12-04T09:43:32.8943512Z * [new branch] orig/release/1.8 -> origin/orig/release/1.8 2025-12-04T09:43:32.8945416Z * [new branch] orig/release/1.9 -> origin/orig/release/1.9 2025-12-04T09:43:32.8947100Z * [new branch] orig/release/2.0 -> origin/orig/release/2.0 2025-12-04T09:43:32.8948950Z * [new branch] orig/release/2.1 -> origin/orig/release/2.1 2025-12-04T09:43:32.8950658Z * [new branch] orig/release/2.2 -> origin/orig/release/2.2 2025-12-04T09:43:32.8952529Z * [new branch] orig/release/2.3 -> origin/orig/release/2.3 2025-12-04T09:43:32.8954207Z * [new branch] orig/release/2.4 -> origin/orig/release/2.4 2025-12-04T09:43:32.8955880Z * [new branch] orig/release/2.5 -> origin/orig/release/2.5 2025-12-04T09:43:32.8957773Z * [new branch] orig/release/2.6 -> origin/orig/release/2.6 2025-12-04T09:43:32.8959879Z * [new branch] orig/release/2.7 -> origin/orig/release/2.7 2025-12-04T09:43:32.8962158Z * [new branch] orig/release/2.8 -> origin/orig/release/2.8 2025-12-04T09:43:32.8963861Z * [new branch] orig/release/2.9 -> origin/orig/release/2.9 2025-12-04T09:43:32.8967581Z * [new branch] origin/gh/fxdawnn/1/base -> origin/origin/gh/fxdawnn/1/base 2025-12-04T09:43:32.8969238Z * [new branch] origin/gh/fxdawnn/1/orig -> origin/origin/gh/fxdawnn/1/orig 2025-12-04T09:43:32.8972070Z * [new branch] origin/gh/zpcore/14/orig -> origin/origin/gh/zpcore/14/orig 2025-12-04T09:43:32.8973909Z * [new branch] oulgen-patch-1 -> origin/oulgen-patch-1 2025-12-04T09:43:32.8975774Z * [new branch] oulgen-patch-2 -> origin/oulgen-patch-2 2025-12-04T09:43:32.8977601Z * [new branch] oulgen-patch-3 -> origin/oulgen-patch-3 2025-12-04T09:43:32.8979428Z * [new branch] oulgen-patch-4 -> origin/oulgen-patch-4 2025-12-04T09:43:32.8981244Z * [new branch] padded-tensor -> origin/padded-tensor 2025-12-04T09:43:32.8983124Z * [new branch] pca2 -> origin/pca2 2025-12-04T09:43:32.8985092Z * [new branch] per_channel_backup -> origin/per_channel_backup 2025-12-04T09:43:32.8987042Z * [new branch] perf_ops -> origin/perf_ops 2025-12-04T09:43:32.8988885Z * [new branch] perf_ops_2_9 -> origin/perf_ops_2_9 2025-12-04T09:43:32.8990798Z * [new branch] pianpwk-patch-1 -> origin/pianpwk-patch-1 2025-12-04T09:43:32.8993126Z * [new branch] pianpwk/__draft_debug_mode -> origin/pianpwk/__draft_debug_mode 2025-12-04T09:43:32.8994817Z * [new branch] pianpwk/_debug_mode_for_triton_draft -> origin/pianpwk/_debug_mode_for_triton_draft 2025-12-04T09:43:32.8996385Z * [new branch] pianpwk/_debug_nn_module_compile -> origin/pianpwk/_debug_nn_module_compile 2025-12-04T09:43:32.8998048Z * [new branch] pianpwk/_draft_triton_11_3 -> origin/pianpwk/_draft_triton_11_3 2025-12-04T09:43:32.9000044Z * [new branch] pianpwk/_manual_bucket_draft -> origin/pianpwk/_manual_bucket_draft 2025-12-04T09:43:32.9002015Z * [new branch] pianpwk/_profile_w_dispatch_keys -> origin/pianpwk/_profile_w_dispatch_keys 2025-12-04T09:43:32.9004024Z * [new branch] pianpwk/_super_draft_debug_mode -> origin/pianpwk/_super_draft_debug_mode 2025-12-04T09:43:32.9005993Z * [new branch] pianpwk/_unbacked_local_shard_size -> origin/pianpwk/_unbacked_local_shard_size 2025-12-04T09:43:32.9007689Z * [new branch] pianpwk/anomaly_tb -> origin/pianpwk/anomaly_tb 2025-12-04T09:43:32.9009413Z * [new branch] pianpwk/auto_fx_annotate -> origin/pianpwk/auto_fx_annotate 2025-12-04T09:43:32.9011303Z * [new branch] pianpwk/backed_size_oblivious_export -> origin/pianpwk/backed_size_oblivious_export 2025-12-04T09:43:32.9012976Z * [new branch] pianpwk/bert_dynamic_perf -> origin/pianpwk/bert_dynamic_perf 2025-12-04T09:43:32.9014769Z * [new branch] pianpwk/debug_fwd_stack_traces -> origin/pianpwk/debug_fwd_stack_traces 2025-12-04T09:43:32.9016569Z * [new branch] pianpwk/debug_hash_tensor -> origin/pianpwk/debug_hash_tensor 2025-12-04T09:43:32.9018305Z * [new branch] pianpwk/debug_mode_annotate -> origin/pianpwk/debug_mode_annotate 2025-12-04T09:43:32.9020006Z * [new branch] pianpwk/debug_mode_defaults -> origin/pianpwk/debug_mode_defaults 2025-12-04T09:43:32.9021642Z * [new branch] pianpwk/debug_mode_hacks -> origin/pianpwk/debug_mode_hacks 2025-12-04T09:43:32.9023436Z * [new branch] pianpwk/debug_mode_opcall_refactor -> origin/pianpwk/debug_mode_opcall_refactor 2025-12-04T09:43:32.9025107Z * [new branch] pianpwk/debug_mode_show_ids -> origin/pianpwk/debug_mode_show_ids 2025-12-04T09:43:32.9026822Z * [new branch] pianpwk/debug_mode_triton -> origin/pianpwk/debug_mode_triton 2025-12-04T09:43:32.9028807Z * [new branch] pianpwk/debug_show_stack_trace -> origin/pianpwk/debug_show_stack_trace 2025-12-04T09:43:32.9030567Z * [new branch] pianpwk/debug_wait_on_collective -> origin/pianpwk/debug_wait_on_collective 2025-12-04T09:43:32.9032377Z * [new branch] pianpwk/debugmode_compile_tf -> origin/pianpwk/debugmode_compile_tf 2025-12-04T09:43:32.9034169Z * [new branch] pianpwk/dispatch_key_debugging_for_debug -> origin/pianpwk/dispatch_key_debugging_for_debug 2025-12-04T09:43:32.9035872Z * [new branch] pianpwk/draft_debug_mode_tfcompile -> origin/pianpwk/draft_debug_mode_tfcompile 2025-12-04T09:43:32.9037616Z * [new branch] pianpwk/draft_multikernel_nn -> origin/pianpwk/draft_multikernel_nn 2025-12-04T09:43:32.9039960Z * [new branch] pianpwk/draft_multikernel_status_10_5 -> origin/pianpwk/draft_multikernel_status_10_5 2025-12-04T09:43:32.9041739Z * [new branch] pianpwk/dtensor_custom_chunk -> origin/pianpwk/dtensor_custom_chunk 2025-12-04T09:43:32.9043571Z * [new branch] pianpwk/dtensor_unbacked_keypath -> origin/pianpwk/dtensor_unbacked_keypath 2025-12-04T09:43:32.9045477Z * [new branch] pianpwk/event_list_tree -> origin/pianpwk/event_list_tree 2025-12-04T09:43:32.9047132Z * [new branch] pianpwk/false_numel_refs -> origin/pianpwk/false_numel_refs 2025-12-04T09:43:32.9048840Z * [new branch] pianpwk/maybe_guard_rel -> origin/pianpwk/maybe_guard_rel 2025-12-04T09:43:32.9050579Z * [new branch] pianpwk/multikernel_hints_draft -> origin/pianpwk/multikernel_hints_draft 2025-12-04T09:43:32.9052511Z * [new branch] pianpwk/no_size_oblivious_slice_scat -> origin/pianpwk/no_size_oblivious_slice_scat 2025-12-04T09:43:32.9054247Z * [new branch] pianpwk/oblivious_reshape_view_better -> origin/pianpwk/oblivious_reshape_view_better 2025-12-04T09:43:32.9056108Z * [new branch] pianpwk/pre_forward_hook -> origin/pianpwk/pre_forward_hook 2025-12-04T09:43:32.9057925Z * [new branch] pianpwk/skip_python_keys_alternate -> origin/pianpwk/skip_python_keys_alternate 2025-12-04T09:43:32.9059675Z * [new branch] pianpwk/skip_python_keys_in_guards -> origin/pianpwk/skip_python_keys_in_guards 2025-12-04T09:43:32.9061356Z * [new branch] pianpwk/sym_tokens_draft -> origin/pianpwk/sym_tokens_draft 2025-12-04T09:43:32.9064044Z * [new branch] pianpwk/symint_one_hot -> origin/pianpwk/symint_one_hot 2025-12-04T09:43:32.9065485Z * [new branch] pianpwk/test_pointwise_guard_or_false -> origin/pianpwk/test_pointwise_guard_or_false 2025-12-04T09:43:32.9066616Z * [new branch] pianpwk/totally_draft_sym_wrap -> origin/pianpwk/totally_draft_sym_wrap 2025-12-04T09:43:32.9068653Z * [new branch] pianpwk/try_dumb_stuff -> origin/pianpwk/try_dumb_stuff 2025-12-04T09:43:32.9070366Z * [new branch] pianpwk/try_dumb_stuff_2 -> origin/pianpwk/try_dumb_stuff_2 2025-12-04T09:43:32.9072169Z * [new branch] pianpwk/unbacked_dtensor_mm -> origin/pianpwk/unbacked_dtensor_mm 2025-12-04T09:43:32.9073889Z * [new branch] pianpwk/unbacked_tracing_12_2 -> origin/pianpwk/unbacked_tracing_12_2 2025-12-04T09:43:32.9075610Z * [new branch] pianpwk/user_symints -> origin/pianpwk/user_symints 2025-12-04T09:43:32.9077254Z * [new branch] pianpwk/wan21_reshape -> origin/pianpwk/wan21_reshape 2025-12-04T09:43:32.9079621Z * [new branch] piz/fix_partial_backward_1112 -> origin/piz/fix_partial_backward_1112 2025-12-04T09:43:32.9081220Z * [new branch] piz/prop_cache_clean -> origin/piz/prop_cache_clean 2025-12-04T09:43:32.9083110Z * [new branch] pool-separate -> origin/pool-separate 2025-12-04T09:43:32.9084802Z * [new branch] pr-156087 -> origin/pr-156087 2025-12-04T09:43:32.9087221Z * [new branch] pr/131860 -> origin/pr/131860 2025-12-04T09:43:32.9088987Z * [new branch] predispatch_to -> origin/predispatch_to 2025-12-04T09:43:32.9090819Z * [new branch] protect-c17 -> origin/protect-c17 2025-12-04T09:43:32.9092547Z * [new branch] pt-opt-cuda3 -> origin/pt-opt-cuda3 2025-12-04T09:43:32.9094892Z * [new branch] python_compiled_autograd -> origin/python_compiled_autograd 2025-12-04T09:43:32.9097496Z * [new branch] q1l1/fix_device_moved_constant_type_unknown -> origin/q1l1/fix_device_moved_constant_type_unknown 2025-12-04T09:43:32.9099192Z * [new branch] q1l1/fix_wrong_default_type_for_kernel_call_args -> origin/q1l1/fix_wrong_default_type_for_kernel_call_args 2025-12-04T09:43:32.9101657Z * [new branch] qchip/export-D54134695 -> origin/qchip/export-D54134695 2025-12-04T09:43:32.9103615Z * [new branch] quote-pytest_cache -> origin/quote-pytest_cache 2025-12-04T09:43:32.9105746Z * [new branch] reland-accgrad-stream-warn -> origin/reland-accgrad-stream-warn 2025-12-04T09:43:32.9108369Z * [new branch] release/1.10 -> origin/release/1.10 2025-12-04T09:43:32.9110067Z * [new branch] release/1.11 -> origin/release/1.11 2025-12-04T09:43:32.9111802Z * [new branch] release/1.12 -> origin/release/1.12 2025-12-04T09:43:32.9113512Z * [new branch] release/1.13 -> origin/release/1.13 2025-12-04T09:43:32.9115155Z * [new branch] release/1.4 -> origin/release/1.4 2025-12-04T09:43:32.9116731Z * [new branch] release/1.4.1 -> origin/release/1.4.1 2025-12-04T09:43:32.9118413Z * [new branch] release/1.5 -> origin/release/1.5 2025-12-04T09:43:32.9120172Z * [new branch] release/1.6 -> origin/release/1.6 2025-12-04T09:43:32.9121887Z * [new branch] release/1.7 -> origin/release/1.7 2025-12-04T09:43:32.9123678Z * [new branch] release/1.8 -> origin/release/1.8 2025-12-04T09:43:32.9125871Z * [new branch] release/1.9 -> origin/release/1.9 2025-12-04T09:43:32.9127595Z * [new branch] release/2.0 -> origin/release/2.0 2025-12-04T09:43:32.9129399Z * [new branch] release/2.1 -> origin/release/2.1 2025-12-04T09:43:32.9131192Z * [new branch] release/2.2 -> origin/release/2.2 2025-12-04T09:43:32.9133305Z * [new branch] release/2.3 -> origin/release/2.3 2025-12-04T09:43:32.9135424Z * [new branch] release/2.4 -> origin/release/2.4 2025-12-04T09:43:32.9137652Z * [new branch] release/2.5 -> origin/release/2.5 2025-12-04T09:43:32.9139471Z * [new branch] release/2.6 -> origin/release/2.6 2025-12-04T09:43:32.9141362Z * [new branch] release/2.7 -> origin/release/2.7 2025-12-04T09:43:32.9143172Z * [new branch] release/2.8 -> origin/release/2.8 2025-12-04T09:43:32.9145167Z * [new branch] release/2.9 -> origin/release/2.9 2025-12-04T09:43:32.9147002Z * [new branch] release_notes -> origin/release_notes 2025-12-04T09:43:32.9148950Z * [new branch] remove_pyinterpreter -> origin/remove_pyinterpreter 2025-12-04T09:43:32.9151077Z * [new branch] replace-pytorch-labs-20250812-195836 -> origin/replace-pytorch-labs-20250812-195836 2025-12-04T09:43:32.9152752Z * [new branch] replace-pytorch-labs-20250812-200248 -> origin/replace-pytorch-labs-20250812-200248 2025-12-04T09:43:32.9154360Z * [new branch] replace-pytorch-labs-20250812-200324 -> origin/replace-pytorch-labs-20250812-200324 2025-12-04T09:43:32.9156422Z * [new branch] replace-pytorch-labs-20250812-204020 -> origin/replace-pytorch-labs-20250812-204020 2025-12-04T09:43:32.9159711Z * [new branch] revert-131069-gh/krzysztofjordan/1/head -> origin/revert-131069-gh/krzysztofjordan/1/head 2025-12-04T09:43:32.9162913Z * [new branch] revert-131469-gh/andrewor14/51/head -> origin/revert-131469-gh/andrewor14/51/head 2025-12-04T09:43:32.9166187Z * [new branch] revert-152361-gh/fadara01/1/head -> origin/revert-152361-gh/fadara01/1/head 2025-12-04T09:43:32.9169563Z * [new branch] revert-156870-gh/skarjala/3/head -> origin/revert-156870-gh/skarjala/3/head 2025-12-04T09:43:32.9171733Z * [new branch] revert-157914-cherry-pick-157503-by-pytorch_bot_bot_ -> origin/revert-157914-cherry-pick-157503-by-pytorch_bot_bot_ 2025-12-04T09:43:32.9173386Z * [new branch] revert-hoo-invoke-subgraph -> origin/revert-hoo-invoke-subgraph 2025-12-04T09:43:32.9175232Z * [new branch] revert_always_build_distributed -> origin/revert_always_build_distributed 2025-12-04T09:43:32.9176974Z * [new branch] rms_norm_patch -> origin/rms_norm_patch 2025-12-04T09:43:32.9179538Z * [new branch] ruisi/fix_all_to_all_estimation -> origin/ruisi/fix_all_to_all_estimation 2025-12-04T09:43:32.9181034Z * [new branch] ruisi/fix_comm_estimation -> origin/ruisi/fix_comm_estimation 2025-12-04T09:43:32.9182660Z * [new branch] ruisi/fix_dynamic_shape_estimation -> origin/ruisi/fix_dynamic_shape_estimation 2025-12-04T09:43:32.9184290Z * [new branch] ruisi/fix_llama3_autobucketing -> origin/ruisi/fix_llama3_autobucketing 2025-12-04T09:43:32.9186280Z * [new branch] ruisi/fix_manual_bucketing_ep_pass -> origin/ruisi/fix_manual_bucketing_ep_pass 2025-12-04T09:43:32.9188365Z * [new branch] ruisi/manual_bucket_pass -> origin/ruisi/manual_bucket_pass 2025-12-04T09:43:32.9190903Z * [new branch] ryanguo99/cleanup-dynamo-expected-failures -> origin/ryanguo99/cleanup-dynamo-expected-failures 2025-12-04T09:43:32.9192367Z * [new branch] ryanguo99/fix-closure-var -> origin/ryanguo99/fix-closure-var 2025-12-04T09:43:32.9194679Z * [new branch] rzou/faketensor_bench -> origin/rzou/faketensor_bench 2025-12-04T09:43:32.9196288Z * [new branch] rzou/njt -> origin/rzou/njt 2025-12-04T09:43:32.9197979Z * [new branch] rzou/pca -> origin/rzou/pca 2025-12-04T09:43:32.9199613Z * [new branch] rzou/realprop -> origin/rzou/realprop 2025-12-04T09:43:32.9201467Z * [new branch] samplevllm -> origin/samplevllm 2025-12-04T09:43:32.9204104Z * [new branch] sanchitintel/weird_thing_with_test_cpu_select_algorithm -> origin/sanchitintel/weird_thing_with_test_cpu_select_algorithm 2025-12-04T09:43:32.9205872Z * [new branch] sapling-pr-archive-SS-JIA -> origin/sapling-pr-archive-SS-JIA 2025-12-04T09:43:32.9207762Z * [new branch] sapling-pr-archive-tushar00jain -> origin/sapling-pr-archive-tushar00jain 2025-12-04T09:43:32.9209443Z * [new branch] save -> origin/save 2025-12-04T09:43:32.9211235Z * [new branch] scaled_mm -> origin/scaled_mm 2025-12-04T09:43:32.9213010Z * [new branch] scan_attempt -> origin/scan_attempt 2025-12-04T09:43:32.9215473Z * [new branch] sdym/2.5.1 -> origin/sdym/2.5.1 2025-12-04T09:43:32.9217403Z * [new branch] sekyondaMeta-dynamoconfig-fix -> origin/sekyondaMeta-dynamoconfig-fix 2025-12-04T09:43:32.9219673Z * [new branch] shengf/fx-xform-perf -> origin/shengf/fx-xform-perf 2025-12-04T09:43:32.9221561Z * [new branch] shoumikhin-patch-1 -> origin/shoumikhin-patch-1 2025-12-04T09:43:32.9223347Z * [new branch] solve-accuracy-fix -> origin/solve-accuracy-fix 2025-12-04T09:43:32.9225146Z * [new branch] some_rocm_inductor_skips -> origin/some_rocm_inductor_skips 2025-12-04T09:43:32.9227595Z * [new branch] soulitzer/stash-tls-ac -> origin/soulitzer/stash-tls-ac 2025-12-04T09:43:32.9229435Z * [new branch] sparse-mm-bf16-support -> origin/sparse-mm-bf16-support 2025-12-04T09:43:32.9231218Z * [new branch] starterTaskUpdate -> origin/starterTaskUpdate 2025-12-04T09:43:32.9232990Z * [new branch] suo -> origin/suo 2025-12-04T09:43:32.9234796Z * [new branch] sve-poc -> origin/sve-poc 2025-12-04T09:43:32.9236583Z * [new branch] switch-bn -> origin/switch-bn 2025-12-04T09:43:32.9238389Z * [new branch] sy_annotation_in_autograd_hop -> origin/sy_annotation_in_autograd_hop 2025-12-04T09:43:32.9240143Z * [new branch] sy_aot_eager_record -> origin/sy_aot_eager_record 2025-12-04T09:43:32.9241991Z * [new branch] sy_custom_bucketing -> origin/sy_custom_bucketing 2025-12-04T09:43:32.9243919Z * [new branch] sy_debug_mode_test -> origin/sy_debug_mode_test 2025-12-04T09:43:32.9245664Z * [new branch] sy_deserialize -> origin/sy_deserialize 2025-12-04T09:43:32.9247376Z * [new branch] sy_dump_gm_code -> origin/sy_dump_gm_code 2025-12-04T09:43:32.9249132Z * [new branch] sy_exp -> origin/sy_exp 2025-12-04T09:43:32.9250996Z * [new branch] sy_export_annotation -> origin/sy_export_annotation 2025-12-04T09:43:32.9253274Z * [new branch] sy_invoke_subgraph -> origin/sy_invoke_subgraph 2025-12-04T09:43:32.9255003Z * [new branch] sy_kernel_bw_name -> origin/sy_kernel_bw_name 2025-12-04T09:43:32.9257072Z * [new branch] sy_multi_arch -> origin/sy_multi_arch 2025-12-04T09:43:32.9258826Z * [new branch] sy_nn_module_stack -> origin/sy_nn_module_stack 2025-12-04T09:43:32.9260677Z * [new branch] sy_original_dtensor -> origin/sy_original_dtensor 2025-12-04T09:43:32.9262401Z * [new branch] sy_profiler_cia -> origin/sy_profiler_cia 2025-12-04T09:43:32.9264568Z * [new branch] symm_mem_sync -> origin/symm_mem_sync 2025-12-04T09:43:32.9266430Z * [new branch] sympy-bottleneck-repro -> origin/sympy-bottleneck-repro 2025-12-04T09:43:32.9268439Z * [new branch] tensordict_integration -> origin/tensordict_integration 2025-12-04T09:43:32.9270264Z * [new branch] test-move-conda-builds -> origin/test-move-conda-builds 2025-12-04T09:43:32.9272154Z * [new branch] test-old -> origin/test-old 2025-12-04T09:43:32.9274504Z * [new branch] test/bmm_heur -> origin/test/bmm_heur 2025-12-04T09:43:32.9276878Z * [new branch] tianren/customOp_autotune_fix -> origin/tianren/customOp_autotune_fix 2025-12-04T09:43:32.9278533Z * [new branch] tianren/customOp_enable_max_autotune -> origin/tianren/customOp_enable_max_autotune 2025-12-04T09:43:32.9280115Z * [new branch] tianren/customOp_fusion -> origin/tianren/customOp_fusion 2025-12-04T09:43:32.9281846Z * [new branch] tianren/customop_collectiveop_benchmark -> origin/tianren/customop_collectiveop_benchmark 2025-12-04T09:43:32.9283832Z * [new branch] tianren/customop_collectiveop_benchmark_fix -> origin/tianren/customop_collectiveop_benchmark_fix 2025-12-04T09:43:32.9285974Z * [new branch] tianren/customop_dynamic_config -> origin/tianren/customop_dynamic_config 2025-12-04T09:43:32.9287704Z * [new branch] tianren/dynamic_range_input -> origin/tianren/dynamic_range_input 2025-12-04T09:43:32.9289431Z * [new branch] tianren/dynamic_range_input_fix -> origin/tianren/dynamic_range_input_fix 2025-12-04T09:43:32.9291218Z * [new branch] tianren/dynamic_range_input_merge -> origin/tianren/dynamic_range_input_merge 2025-12-04T09:43:32.9292914Z * [new branch] tianren/flex_paged_attn_fix_temp -> origin/tianren/flex_paged_attn_fix_temp 2025-12-04T09:43:32.9294705Z * [new branch] tianren/fx_codegen_dump -> origin/tianren/fx_codegen_dump 2025-12-04T09:43:32.9296427Z * [new branch] tianren/symmetric_memory -> origin/tianren/symmetric_memory 2025-12-04T09:43:32.9298179Z * [new branch] tianren/test -> origin/tianren/test 2025-12-04T09:43:32.9300000Z * [new branch] tidy_performance_cyy -> origin/tidy_performance_cyy 2025-12-04T09:43:32.9301736Z * [new branch] tmp -> origin/tmp 2025-12-04T09:43:32.9303563Z * [new branch] torchtitan_ep -> origin/torchtitan_ep 2025-12-04T09:43:32.9305414Z * [new branch] torchtitan_integration -> origin/torchtitan_integration 2025-12-04T09:43:32.9307406Z * [new branch] trace_fsdp_torchtune_lora -> origin/trace_fsdp_torchtune_lora 2025-12-04T09:43:32.9309130Z * [new branch] traceable_fsdp_unit_tests -> origin/traceable_fsdp_unit_tests 2025-12-04T09:43:32.9310954Z * [new branch] tree_loop_vec_base -> origin/tree_loop_vec_base 2025-12-04T09:43:32.9312805Z * [new branch] triton_kernel -> origin/triton_kernel 2025-12-04T09:43:32.9314571Z * [new branch] tt_pkg_1908 -> origin/tt_pkg_1908 2025-12-04T09:43:32.9316425Z * [new branch] type_dec -> origin/type_dec 2025-12-04T09:43:32.9318257Z * [new branch] udate-sphinx-dependancies -> origin/udate-sphinx-dependancies 2025-12-04T09:43:32.9320689Z * [new branch] update-audio-commit-hash/17630256502-1803-1 -> origin/update-audio-commit-hash/17630256502-1803-1 2025-12-04T09:43:32.9322333Z * [new branch] update-audio-commit-hash/19087141161-1916-1 -> origin/update-audio-commit-hash/19087141161-1916-1 2025-12-04T09:43:32.9323970Z * [new branch] update-audio-commit-hash/19250643381-1929-1 -> origin/update-audio-commit-hash/19250643381-1929-1 2025-12-04T09:43:32.9325777Z * [new branch] update-audio-commit-hash/19397724337-1935-1 -> origin/update-audio-commit-hash/19397724337-1935-1 2025-12-04T09:43:32.9327378Z * [new branch] update-audio-commit-hash/19555670148-1941-1 -> origin/update-audio-commit-hash/19555670148-1941-1 2025-12-04T09:43:32.9329326Z * [new branch] update-audio-commit-hash/19750627930-1946-1 -> origin/update-audio-commit-hash/19750627930-1946-1 2025-12-04T09:43:32.9332221Z * [new branch] update-triton-commit-hash/13663274526-1487-2 -> origin/update-triton-commit-hash/13663274526-1487-2 2025-12-04T09:43:32.9334563Z * [new branch] update-vision-commit-hash/19087141161-1916-1 -> origin/update-vision-commit-hash/19087141161-1916-1 2025-12-04T09:43:32.9336201Z * [new branch] update-vision-commit-hash/19184897099-1925-1 -> origin/update-vision-commit-hash/19184897099-1925-1 2025-12-04T09:43:32.9337772Z * [new branch] update-vision-commit-hash/19250643381-1929-1 -> origin/update-vision-commit-hash/19250643381-1929-1 2025-12-04T09:43:32.9339413Z * [new branch] update-vision-commit-hash/19381328640-1934-1 -> origin/update-vision-commit-hash/19381328640-1934-1 2025-12-04T09:43:32.9341052Z * [new branch] update-vision-commit-hash/19485237164-1938-1 -> origin/update-vision-commit-hash/19485237164-1938-1 2025-12-04T09:43:32.9343451Z * [new branch] update-vllm-commit-hash/18451675449-1879-1 -> origin/update-vllm-commit-hash/18451675449-1879-1 2025-12-04T09:43:32.9345305Z * [new branch] update-vllm-dockerfile -> origin/update-vllm-dockerfile 2025-12-04T09:43:32.9347788Z * [new branch] update-xla-commit-hash/19224287370-211-1 -> origin/update-xla-commit-hash/19224287370-211-1 2025-12-04T09:43:32.9349634Z * [new branch] update-xla-commit-hash/19422028566-212-1 -> origin/update-xla-commit-hash/19422028566-212-1 2025-12-04T09:43:32.9351156Z * [new branch] update-xla-commit-hash/19626841311-213-1 -> origin/update-xla-commit-hash/19626841311-213-1 2025-12-04T09:43:32.9353004Z * [new branch] update_docs_torch_multinomial_issue#125388 -> origin/update_docs_torch_multinomial_issue#125388 2025-12-04T09:43:32.9354719Z * [new branch] update_operator_readme -> origin/update_operator_readme 2025-12-04T09:43:32.9358789Z * [new branch] update_slow_tests_1722488736 -> origin/update_slow_tests_1722488736 2025-12-04T09:43:32.9360544Z * [new branch] update_slow_tests_1722879173 -> origin/update_slow_tests_1722879173 2025-12-04T09:43:32.9362341Z * [new branch] update_slow_tests_1762155677 -> origin/update_slow_tests_1762155677 2025-12-04T09:43:32.9364307Z * [new branch] update_slow_tests_1763365283 -> origin/update_slow_tests_1763365283 2025-12-04T09:43:32.9366029Z * [new branch] update_submodule_FBGEMM -> origin/update_submodule_FBGEMM 2025-12-04T09:43:32.9367829Z * [new branch] update_submodule_kineto -> origin/update_submodule_kineto 2025-12-04T09:43:32.9369665Z * [new branch] update_submodule_tensorpipe -> origin/update_submodule_tensorpipe 2025-12-04T09:43:32.9371495Z * [new branch] upload-tests-for-autorevert -> origin/upload-tests-for-autorevert 2025-12-04T09:43:32.9373318Z * [new branch] v0.1.2 -> origin/v0.1.2 2025-12-04T09:43:32.9375168Z * [new branch] v1.0.1 -> origin/v1.0.1 2025-12-04T09:43:32.9377045Z * [new branch] v1.0.3 -> origin/v1.0.3 2025-12-04T09:43:32.9379167Z * [new branch] v1.1.0 -> origin/v1.1.0 2025-12-04T09:43:32.9381241Z * [new branch] v1.2.0 -> origin/v1.2.0 2025-12-04T09:43:32.9383092Z * [new branch] v1.3.0 -> origin/v1.3.0 2025-12-04T09:43:32.9384980Z * [new branch] v1.3.1 -> origin/v1.3.1 2025-12-04T09:43:32.9386825Z * [new branch] validate_fn -> origin/validate_fn 2025-12-04T09:43:32.9388839Z * [new branch] validations_2.6 -> origin/validations_2.6 2025-12-04T09:43:32.9390660Z * [new branch] validations_2.8 -> origin/validations_2.8 2025-12-04T09:43:32.9392436Z * [new branch] varlen-api -> origin/varlen-api 2025-12-04T09:43:32.9394275Z * [new branch] varlen-api-backup -> origin/varlen-api-backup 2025-12-04T09:43:32.9396020Z * [new branch] varlen_batch_invariance -> origin/varlen_batch_invariance 2025-12-04T09:43:32.9398363Z * [new branch] viable/strict -> origin/viable/strict 2025-12-04T09:43:32.9401248Z * [new branch] vishal9-team/dtensor_parallelism_toy -> origin/vishal9-team/dtensor_parallelism_toy 2025-12-04T09:43:32.9403183Z * [new branch] vllmbuildci -> origin/vllmbuildci 2025-12-04T09:43:32.9405051Z * [new branch] vllmpin -> origin/vllmpin 2025-12-04T09:43:32.9406988Z * [new branch] vscode-recommend-pyrefly -> origin/vscode-recommend-pyrefly 2025-12-04T09:43:32.9408963Z * [new branch] wdvr-patch-1 -> origin/wdvr-patch-1 2025-12-04T09:43:32.9411321Z * [new branch] wdvr/iss_145259 -> origin/wdvr/iss_145259 2025-12-04T09:43:32.9413991Z * [new branch] whc/pei -> origin/whc/pei 2025-12-04T09:43:32.9415731Z * [new branch] whc/pp_fix -> origin/whc/pp_fix 2025-12-04T09:43:32.9417453Z * [new branch] whc/sharding -> origin/whc/sharding 2025-12-04T09:43:32.9419090Z * [new branch] whc/sharding2 -> origin/whc/sharding2 2025-12-04T09:43:32.9420692Z * [new branch] whc/uneven -> origin/whc/uneven 2025-12-04T09:43:32.9422589Z * [new branch] whc/uneven-merge -> origin/whc/uneven-merge 2025-12-04T09:43:32.9424492Z * [new branch] win_warnings -> origin/win_warnings 2025-12-04T09:43:32.9426308Z * [new branch] windows_libtorch_free -> origin/windows_libtorch_free 2025-12-04T09:43:32.9428212Z * [new branch] xmfan-war -> origin/xmfan-war 2025-12-04T09:43:32.9430502Z * [new branch] xmfan/ca_0516 -> origin/xmfan/ca_0516 2025-12-04T09:43:32.9432271Z * [new branch] xmfan/ca_1051b93192 -> origin/xmfan/ca_1051b93192 2025-12-04T09:43:32.9434030Z * [new branch] xmfan/ca_1a722f62c248391fc4a542e8851a5559aa356ae8 -> origin/xmfan/ca_1a722f62c248391fc4a542e8851a5559aa356ae8 2025-12-04T09:43:32.9435276Z * [new branch] xmfan/ca_5a2be192d1 -> origin/xmfan/ca_5a2be192d1 2025-12-04T09:43:32.9437104Z * [new branch] xmfan/ca_9d59b516e9 -> origin/xmfan/ca_9d59b516e9 2025-12-04T09:43:32.9438979Z * [new branch] xmfan/ca_apr8 -> origin/xmfan/ca_apr8 2025-12-04T09:43:32.9440359Z * [new branch] xmfan/ca_base -> origin/xmfan/ca_base 2025-12-04T09:43:32.9442531Z * [new branch] xmfan/ca_dynamic -> origin/xmfan/ca_dynamic 2025-12-04T09:43:32.9444437Z * [new branch] xmfan/ca_fix_dyn -> origin/xmfan/ca_fix_dyn 2025-12-04T09:43:32.9446712Z * [new branch] xmfan/ca_fix_lowering -> origin/xmfan/ca_fix_lowering 2025-12-04T09:43:32.9448438Z * [new branch] xmfan/ca_fix_polyfills -> origin/xmfan/ca_fix_polyfills 2025-12-04T09:43:32.9450004Z * [new branch] xmfan/ca_jan3 -> origin/xmfan/ca_jan3 2025-12-04T09:43:32.9451734Z * [new branch] xmfan/ca_jun18 -> origin/xmfan/ca_jun18 2025-12-04T09:43:32.9453553Z * [new branch] xmfan/ca_jun24 -> origin/xmfan/ca_jun24 2025-12-04T09:43:32.9455426Z * [new branch] xmfan/ca_nested -> origin/xmfan/ca_nested 2025-12-04T09:43:32.9457207Z * [new branch] xmfan/ca_overhead -> origin/xmfan/ca_overhead 2025-12-04T09:43:32.9458951Z * [new branch] xmfan/ca_overhead_0eba7e5451 -> origin/xmfan/ca_overhead_0eba7e5451 2025-12-04T09:43:32.9460694Z * [new branch] xmfan/cacu_jun18 -> origin/xmfan/cacu_jun18 2025-12-04T09:43:32.9462418Z * [new branch] xmfan/cacu_jun19 -> origin/xmfan/cacu_jun19 2025-12-04T09:43:32.9464190Z * [new branch] xmfan/cacu_jun4 -> origin/xmfan/cacu_jun4 2025-12-04T09:43:32.9465924Z * [new branch] xmfan/disable_duck_shape -> origin/xmfan/disable_duck_shape 2025-12-04T09:43:32.9467760Z * [new branch] xmfan/fca_cpp_node_passthrough -> origin/xmfan/fca_cpp_node_passthrough 2025-12-04T09:43:32.9469642Z * [new branch] xmfan/post_3945954741e2d37023c5d6954f9483008e0892f9 -> origin/xmfan/post_3945954741e2d37023c5d6954f9483008e0892f9 2025-12-04T09:43:32.9471552Z * [new branch] xmfan/pre_3945954741e2d37023c5d6954f9483008e0892f9 -> origin/xmfan/pre_3945954741e2d37023c5d6954f9483008e0892f9 2025-12-04T09:43:32.9473056Z * [new branch] xmfan/single_step -> origin/xmfan/single_step 2025-12-04T09:43:32.9474912Z * [new branch] xmfan/sth_0829 -> origin/xmfan/sth_0829 2025-12-04T09:43:32.9476712Z * [new branch] xmfan/test -> origin/xmfan/test 2025-12-04T09:43:32.9479113Z * [new branch] yguo/debug-0226-constexpr -> origin/yguo/debug-0226-constexpr 2025-12-04T09:43:32.9480775Z * [new branch] yguo/new_latest_changes -> origin/yguo/new_latest_changes 2025-12-04T09:43:32.9482440Z * [new branch] yguo/patch_constexpr_changes -> origin/yguo/patch_constexpr_changes 2025-12-04T09:43:32.9484600Z * [new branch] yiming/bootcamp -> origin/yiming/bootcamp 2025-12-04T09:43:32.9486329Z * [new branch] yiming/run_with_start_end_rng_hop -> origin/yiming/run_with_start_end_rng_hop 2025-12-04T09:43:32.9488149Z * [new branch] yolo-llama3 -> origin/yolo-llama3 2025-12-04T09:43:32.9490569Z * [new branch] zainr/canary-test -> origin/zainr/canary-test 2025-12-04T09:43:32.9492361Z * [new branch] zainr/cleanup-gh-runners -> origin/zainr/cleanup-gh-runners 2025-12-04T09:43:32.9493979Z * [new branch] zainr/pull-migration-c -> origin/zainr/pull-migration-c 2025-12-04T09:43:32.9495538Z * [new branch] zainr/test2 -> origin/zainr/test2 2025-12-04T09:43:32.9497557Z * [new branch] zasdfgbnm-patch-3 -> origin/zasdfgbnm-patch-3 2025-12-04T09:43:32.9499234Z * [new branch] zb2p -> origin/zb2p 2025-12-04T09:43:32.9501006Z * [new branch] zeros-and-scatter-part2 -> origin/zeros-and-scatter-part2 2025-12-04T09:43:32.9503856Z * [new branch] zhxchen17/ci/vllm_lora_oom -> origin/zhxchen17/ci/vllm_lora_oom 2025-12-04T09:43:32.9506074Z * [new branch] zhxchen17/ci/vllm_multimodal_oom -> origin/zhxchen17/ci/vllm_multimodal_oom 2025-12-04T09:43:32.9507787Z * [new branch] zhxchen17/ci/vllm_pin -> origin/zhxchen17/ci/vllm_pin 2025-12-04T09:43:32.9510205Z * [new branch] zhxchen17/dynamo/unsafe_drop_all_guards -> origin/zhxchen17/dynamo/unsafe_drop_all_guards 2025-12-04T09:43:32.9512572Z * [new branch] zhxchen17/export/call_override -> origin/zhxchen17/export/call_override 2025-12-04T09:43:32.9514235Z * [new branch] zhxchen17/export/codemod1 -> origin/zhxchen17/export/codemod1 2025-12-04T09:43:32.9515930Z * [new branch] zhxchen17/export/ctx_return -> origin/zhxchen17/export/ctx_return 2025-12-04T09:43:32.9517734Z * [new branch] zhxchen17/export/disable_side_effect_warn -> origin/zhxchen17/export/disable_side_effect_warn 2025-12-04T09:43:32.9519197Z * [new branch] zhxchen17/export/pytree_check -> origin/zhxchen17/export/pytree_check 2025-12-04T09:43:32.9521584Z * [new branch] zhxchen17/precompile/aoti -> origin/zhxchen17/precompile/aoti 2025-12-04T09:43:32.9523338Z * [new branch] zhxchen17/precompile/globals -> origin/zhxchen17/precompile/globals 2025-12-04T09:43:32.9525040Z * [new branch] zhxchen17/precompile/inductor_guards -> origin/zhxchen17/precompile/inductor_guards 2025-12-04T09:43:32.9527190Z * [new branch] zhxchen17/scratch/0 -> origin/zhxchen17/scratch/0 2025-12-04T09:43:32.9529012Z * [new branch] zhxchen17/torch_export_api_update -> origin/zhxchen17/torch_export_api_update 2025-12-04T09:43:32.9531255Z * [new branch] zhxhcen17/moodycamel -> origin/zhxhcen17/moodycamel 2025-12-04T09:43:32.9533636Z * [new branch] zxiiro/build-times -> origin/zxiiro/build-times 2025-12-04T09:43:32.9535407Z * [new branch] zxiiro/c7i.2xlarge -> origin/zxiiro/c7i.2xlarge 2025-12-04T09:43:32.9537209Z * [new branch] zxiiro/c7i.2xlarge.h100 -> origin/zxiiro/c7i.2xlarge.h100 2025-12-04T09:43:32.9538905Z * [new branch] zxiiro/main -> origin/zxiiro/main 2025-12-04T09:43:32.9540760Z * [new branch] zxiiro/risc64 -> origin/zxiiro/risc64 2025-12-04T09:43:32.9542586Z * [new branch] zxiiro/test-multicloud-arc -> origin/zxiiro/test-multicloud-arc 2025-12-04T09:43:32.9543911Z * [new tag] bc2caa7fdf006894eff7af936babde69ab5a40f8-huydhn-debug -> bc2caa7fdf006894eff7af936babde69ab5a40f8-huydhn-debug 2025-12-04T09:43:32.9545596Z * [new tag] ci/binaries/77164 -> ci/binaries/77164 2025-12-04T09:43:32.9546856Z * [new tag] ciflow/b200/115316 -> ciflow/b200/115316 2025-12-04T09:43:32.9548177Z * [new tag] ciflow/b200/160685 -> ciflow/b200/160685 2025-12-04T09:43:32.9549311Z * [new tag] ciflow/b200/161607 -> ciflow/b200/161607 2025-12-04T09:43:32.9550452Z * [new tag] ciflow/b200/161938 -> ciflow/b200/161938 2025-12-04T09:43:32.9551837Z * [new tag] ciflow/b200/167207 -> ciflow/b200/167207 2025-12-04T09:43:32.9553006Z * [new tag] ciflow/b200/167989 -> ciflow/b200/167989 2025-12-04T09:43:32.9554273Z * [new tag] ciflow/b200/168096 -> ciflow/b200/168096 2025-12-04T09:43:32.9555584Z * [new tag] ciflow/b200/168175 -> ciflow/b200/168175 2025-12-04T09:43:32.9557546Z * [new tag] ciflow/b200/168195 -> ciflow/b200/168195 2025-12-04T09:43:32.9558615Z * [new tag] ciflow/b200/169200 -> ciflow/b200/169200 2025-12-04T09:43:32.9560010Z * [new tag] ciflow/b200/169216 -> ciflow/b200/169216 2025-12-04T09:43:32.9561655Z * [new tag] ciflow/b200/169380 -> ciflow/b200/169380 2025-12-04T09:43:32.9563365Z * [new tag] ciflow/b200/169412 -> ciflow/b200/169412 2025-12-04T09:43:32.9564850Z * [new tag] ciflow/b200/169470 -> ciflow/b200/169470 2025-12-04T09:43:32.9566184Z * [new tag] ciflow/b200/169471 -> ciflow/b200/169471 2025-12-04T09:43:32.9567450Z * [new tag] ciflow/b200/169472 -> ciflow/b200/169472 2025-12-04T09:43:32.9568792Z * [new tag] ciflow/b200/169514 -> ciflow/b200/169514 2025-12-04T09:43:32.9570031Z * [new tag] ciflow/b200/169517 -> ciflow/b200/169517 2025-12-04T09:43:32.9571570Z * [new tag] ciflow/binaries/165922 -> ciflow/binaries/165922 2025-12-04T09:43:32.9572687Z * [new tag] ciflow/binaries/169510 -> ciflow/binaries/169510 2025-12-04T09:43:32.9574266Z * [new tag] ciflow/binaries_wheel/157994 -> ciflow/binaries_wheel/157994 2025-12-04T09:43:32.9575492Z * [new tag] ciflow/binaries_wheel/166829 -> ciflow/binaries_wheel/166829 2025-12-04T09:43:32.9576618Z * [new tag] ciflow/binaries_wheel/167972 -> ciflow/binaries_wheel/167972 2025-12-04T09:43:32.9578091Z * [new tag] ciflow/binaries_wheel/167981 -> ciflow/binaries_wheel/167981 2025-12-04T09:43:32.9579437Z * [new tag] ciflow/dynamo/167695 -> ciflow/dynamo/167695 2025-12-04T09:43:32.9580682Z * [new tag] ciflow/dynamo/168096 -> ciflow/dynamo/168096 2025-12-04T09:43:32.9581923Z * [new tag] ciflow/dynamo/169525 -> ciflow/dynamo/169525 2025-12-04T09:43:32.9583380Z * [new tag] ciflow/h100-cutlass-backend/161938 -> ciflow/h100-cutlass-backend/161938 2025-12-04T09:43:32.9584457Z * [new tag] ciflow/h100-cutlass-backend/161940 -> ciflow/h100-cutlass-backend/161940 2025-12-04T09:43:32.9586057Z * [new tag] ciflow/h100-distributed/168923 -> ciflow/h100-distributed/168923 2025-12-04T09:43:32.9587459Z * [new tag] ciflow/h100-symm-mem/167552 -> ciflow/h100-symm-mem/167552 2025-12-04T09:43:32.9588568Z * [new tag] ciflow/h100-symm-mem/168129 -> ciflow/h100-symm-mem/168129 2025-12-04T09:43:32.9589794Z * [new tag] ciflow/h100-symm-mem/168917 -> ciflow/h100-symm-mem/168917 2025-12-04T09:43:32.9591175Z * [new tag] ciflow/h100-symm-mem/169156 -> ciflow/h100-symm-mem/169156 2025-12-04T09:43:32.9592270Z * [new tag] ciflow/h100-symm-mem/169200 -> ciflow/h100-symm-mem/169200 2025-12-04T09:43:32.9593426Z * [new tag] ciflow/h100-symm-mem/169216 -> ciflow/h100-symm-mem/169216 2025-12-04T09:43:32.9594489Z * [new tag] ciflow/h100-symm-mem/169338 -> ciflow/h100-symm-mem/169338 2025-12-04T09:43:32.9596294Z * [new tag] ciflow/h100-symm-mem/169355 -> ciflow/h100-symm-mem/169355 2025-12-04T09:43:32.9597501Z * [new tag] ciflow/h100-symm-mem/169543 -> ciflow/h100-symm-mem/169543 2025-12-04T09:43:32.9598908Z * [new tag] ciflow/h100/115316 -> ciflow/h100/115316 2025-12-04T09:43:32.9600144Z * [new tag] ciflow/h100/160685 -> ciflow/h100/160685 2025-12-04T09:43:32.9601131Z * [new tag] ciflow/h100/160729 -> ciflow/h100/160729 2025-12-04T09:43:32.9602401Z * [new tag] ciflow/h100/161607 -> ciflow/h100/161607 2025-12-04T09:43:32.9603397Z * [new tag] ciflow/h100/161938 -> ciflow/h100/161938 2025-12-04T09:43:32.9604823Z * [new tag] ciflow/h100/167207 -> ciflow/h100/167207 2025-12-04T09:43:32.9605701Z * [new tag] ciflow/h100/167989 -> ciflow/h100/167989 2025-12-04T09:43:32.9606989Z * [new tag] ciflow/h100/168096 -> ciflow/h100/168096 2025-12-04T09:43:32.9607935Z * [new tag] ciflow/h100/168175 -> ciflow/h100/168175 2025-12-04T09:43:32.9609233Z * [new tag] ciflow/h100/168195 -> ciflow/h100/168195 2025-12-04T09:43:32.9610304Z * [new tag] ciflow/h100/168980 -> ciflow/h100/168980 2025-12-04T09:43:32.9611839Z * [new tag] ciflow/h100/169200 -> ciflow/h100/169200 2025-12-04T09:43:32.9613350Z * [new tag] ciflow/h100/169216 -> ciflow/h100/169216 2025-12-04T09:43:32.9614791Z * [new tag] ciflow/h100/169380 -> ciflow/h100/169380 2025-12-04T09:43:32.9616054Z * [new tag] ciflow/h100/169412 -> ciflow/h100/169412 2025-12-04T09:43:32.9617211Z * [new tag] ciflow/h100/169470 -> ciflow/h100/169470 2025-12-04T09:43:32.9618464Z * [new tag] ciflow/h100/169471 -> ciflow/h100/169471 2025-12-04T09:43:32.9619755Z * [new tag] ciflow/h100/169472 -> ciflow/h100/169472 2025-12-04T09:43:32.9620800Z * [new tag] ciflow/h100/169514 -> ciflow/h100/169514 2025-12-04T09:43:32.9622325Z * [new tag] ciflow/inductor-cu126/168096 -> ciflow/inductor-cu126/168096 2025-12-04T09:43:32.9624134Z * [new tag] ciflow/inductor-micro-benchmark-cpu-x86/168096 -> ciflow/inductor-micro-benchmark-cpu-x86/168096 2025-12-04T09:43:32.9625389Z * [new tag] ciflow/inductor-micro-benchmark/166165 -> ciflow/inductor-micro-benchmark/166165 2025-12-04T09:43:32.9626609Z * [new tag] ciflow/inductor-micro-benchmark/168096 -> ciflow/inductor-micro-benchmark/168096 2025-12-04T09:43:32.9628220Z * [new tag] ciflow/inductor-perf-compare/168096 -> ciflow/inductor-perf-compare/168096 2025-12-04T09:43:32.9629949Z * [new tag] ciflow/inductor-perf-test-nightly-rocm-mi300/168073 -> ciflow/inductor-perf-test-nightly-rocm-mi300/168073 2025-12-04T09:43:32.9631043Z * [new tag] ciflow/inductor-perf-test-nightly-rocm-mi300/168096 -> ciflow/inductor-perf-test-nightly-rocm-mi300/168096 2025-12-04T09:43:32.9632277Z * [new tag] ciflow/inductor-perf-test-nightly-rocm-mi300/169024 -> ciflow/inductor-perf-test-nightly-rocm-mi300/169024 2025-12-04T09:43:32.9633919Z * [new tag] ciflow/inductor-perf-test-nightly-rocm-mi355/169024 -> ciflow/inductor-perf-test-nightly-rocm-mi355/169024 2025-12-04T09:43:32.9635137Z * [new tag] ciflow/inductor-perf-test-nightly/168096 -> ciflow/inductor-perf-test-nightly/168096 2025-12-04T09:43:32.9636637Z * [new tag] ciflow/inductor-periodic/168096 -> ciflow/inductor-periodic/168096 2025-12-04T09:43:32.9637692Z * [new tag] ciflow/inductor-periodic/169024 -> ciflow/inductor-periodic/169024 2025-12-04T09:43:32.9639079Z * [new tag] ciflow/inductor-periodic/169425 -> ciflow/inductor-periodic/169425 2025-12-04T09:43:32.9640615Z * [new tag] ciflow/inductor-rocm-mi200/165545 -> ciflow/inductor-rocm-mi200/165545 2025-12-04T09:43:32.9641796Z * [new tag] ciflow/inductor-rocm-mi200/165997 -> ciflow/inductor-rocm-mi200/165997 2025-12-04T09:43:32.9642943Z * [new tag] ciflow/inductor-rocm-mi200/168096 -> ciflow/inductor-rocm-mi200/168096 2025-12-04T09:43:32.9644346Z * [new tag] ciflow/inductor-rocm-mi200/169063 -> ciflow/inductor-rocm-mi200/169063 2025-12-04T09:43:32.9645380Z * [new tag] ciflow/inductor-rocm-mi200/169425 -> ciflow/inductor-rocm-mi200/169425 2025-12-04T09:43:32.9646978Z * [new tag] ciflow/inductor-rocm-mi300/165545 -> ciflow/inductor-rocm-mi300/165545 2025-12-04T09:43:32.9647962Z * [new tag] ciflow/inductor-rocm-mi300/168096 -> ciflow/inductor-rocm-mi300/168096 2025-12-04T09:43:32.9649068Z * [new tag] ciflow/inductor-rocm-mi300/169063 -> ciflow/inductor-rocm-mi300/169063 2025-12-04T09:43:32.9650224Z * [new tag] ciflow/inductor-rocm-mi300/169425 -> ciflow/inductor-rocm-mi300/169425 2025-12-04T09:43:32.9651905Z * [new tag] ciflow/inductor-rocm/162052 -> ciflow/inductor-rocm/162052 2025-12-04T09:43:32.9652992Z * [new tag] ciflow/inductor-rocm/168971 -> ciflow/inductor-rocm/168971 2025-12-04T09:43:32.9654523Z * [new tag] ciflow/inductor-windows/168096 -> ciflow/inductor-windows/168096 2025-12-04T09:43:32.9656113Z * [new tag] ciflow/inductor/144542 -> ciflow/inductor/144542 2025-12-04T09:43:32.9657302Z * [new tag] ciflow/inductor/146506 -> ciflow/inductor/146506 2025-12-04T09:43:32.9658413Z * [new tag] ciflow/inductor/147990 -> ciflow/inductor/147990 2025-12-04T09:43:32.9671322Z * [new tag] ciflow/inductor/148294 -> ciflow/inductor/148294 2025-12-04T09:43:32.9671772Z * [new tag] ciflow/inductor/148492 -> ciflow/inductor/148492 2025-12-04T09:43:32.9672125Z * [new tag] ciflow/inductor/157149 -> ciflow/inductor/157149 2025-12-04T09:43:32.9672447Z * [new tag] ciflow/inductor/157994 -> ciflow/inductor/157994 2025-12-04T09:43:32.9672805Z * [new tag] ciflow/inductor/160685 -> ciflow/inductor/160685 2025-12-04T09:43:32.9673137Z * [new tag] ciflow/inductor/160686 -> ciflow/inductor/160686 2025-12-04T09:43:32.9673459Z * [new tag] ciflow/inductor/160687 -> ciflow/inductor/160687 2025-12-04T09:43:32.9673774Z * [new tag] ciflow/inductor/160688 -> ciflow/inductor/160688 2025-12-04T09:43:32.9674098Z * [new tag] ciflow/inductor/160706 -> ciflow/inductor/160706 2025-12-04T09:43:32.9674421Z * [new tag] ciflow/inductor/160729 -> ciflow/inductor/160729 2025-12-04T09:43:32.9674743Z * [new tag] ciflow/inductor/161938 -> ciflow/inductor/161938 2025-12-04T09:43:32.9675067Z * [new tag] ciflow/inductor/161939 -> ciflow/inductor/161939 2025-12-04T09:43:32.9675640Z * [new tag] ciflow/inductor/161940 -> ciflow/inductor/161940 2025-12-04T09:43:32.9676758Z * [new tag] ciflow/inductor/162052 -> ciflow/inductor/162052 2025-12-04T09:43:32.9677885Z * [new tag] ciflow/inductor/162275 -> ciflow/inductor/162275 2025-12-04T09:43:32.9679103Z * [new tag] ciflow/inductor/162795 -> ciflow/inductor/162795 2025-12-04T09:43:32.9680509Z * [new tag] ciflow/inductor/163245 -> ciflow/inductor/163245 2025-12-04T09:43:32.9681717Z * [new tag] ciflow/inductor/163335 -> ciflow/inductor/163335 2025-12-04T09:43:32.9682925Z * [new tag] ciflow/inductor/163503 -> ciflow/inductor/163503 2025-12-04T09:43:32.9684176Z * [new tag] ciflow/inductor/163942 -> ciflow/inductor/163942 2025-12-04T09:43:32.9685715Z * [new tag] ciflow/inductor/165270 -> ciflow/inductor/165270 2025-12-04T09:43:32.9686874Z * [new tag] ciflow/inductor/165274 -> ciflow/inductor/165274 2025-12-04T09:43:32.9688070Z * [new tag] ciflow/inductor/165322 -> ciflow/inductor/165322 2025-12-04T09:43:32.9689275Z * [new tag] ciflow/inductor/165597 -> ciflow/inductor/165597 2025-12-04T09:43:32.9690497Z * [new tag] ciflow/inductor/166063 -> ciflow/inductor/166063 2025-12-04T09:43:32.9691839Z * [new tag] ciflow/inductor/166075 -> ciflow/inductor/166075 2025-12-04T09:43:32.9693138Z * [new tag] ciflow/inductor/166165 -> ciflow/inductor/166165 2025-12-04T09:43:32.9694563Z * [new tag] ciflow/inductor/166254 -> ciflow/inductor/166254 2025-12-04T09:43:32.9695680Z * [new tag] ciflow/inductor/166483 -> ciflow/inductor/166483 2025-12-04T09:43:32.9696846Z * [new tag] ciflow/inductor/166494 -> ciflow/inductor/166494 2025-12-04T09:43:32.9698075Z * [new tag] ciflow/inductor/166545 -> ciflow/inductor/166545 2025-12-04T09:43:32.9699351Z * [new tag] ciflow/inductor/166788 -> ciflow/inductor/166788 2025-12-04T09:43:32.9700921Z * [new tag] ciflow/inductor/166846 -> ciflow/inductor/166846 2025-12-04T09:43:32.9702012Z * [new tag] ciflow/inductor/167300 -> ciflow/inductor/167300 2025-12-04T09:43:32.9703231Z * [new tag] ciflow/inductor/167407 -> ciflow/inductor/167407 2025-12-04T09:43:32.9704771Z * [new tag] ciflow/inductor/167536 -> ciflow/inductor/167536 2025-12-04T09:43:32.9705986Z * [new tag] ciflow/inductor/167552 -> ciflow/inductor/167552 2025-12-04T09:43:32.9707189Z * [new tag] ciflow/inductor/167555 -> ciflow/inductor/167555 2025-12-04T09:43:32.9708808Z * [new tag] ciflow/inductor/167583 -> ciflow/inductor/167583 2025-12-04T09:43:32.9709884Z * [new tag] ciflow/inductor/167599 -> ciflow/inductor/167599 2025-12-04T09:43:32.9711137Z * [new tag] ciflow/inductor/167647 -> ciflow/inductor/167647 2025-12-04T09:43:32.9712352Z * [new tag] ciflow/inductor/167677 -> ciflow/inductor/167677 2025-12-04T09:43:32.9713590Z * [new tag] ciflow/inductor/167680 -> ciflow/inductor/167680 2025-12-04T09:43:32.9714829Z * [new tag] ciflow/inductor/167695 -> ciflow/inductor/167695 2025-12-04T09:43:32.9716057Z * [new tag] ciflow/inductor/167742 -> ciflow/inductor/167742 2025-12-04T09:43:32.9717273Z * [new tag] ciflow/inductor/167768 -> ciflow/inductor/167768 2025-12-04T09:43:32.9718842Z * [new tag] ciflow/inductor/167773 -> ciflow/inductor/167773 2025-12-04T09:43:32.9719991Z * [new tag] ciflow/inductor/167781 -> ciflow/inductor/167781 2025-12-04T09:43:32.9721213Z * [new tag] ciflow/inductor/167880 -> ciflow/inductor/167880 2025-12-04T09:43:32.9722411Z * [new tag] ciflow/inductor/167887 -> ciflow/inductor/167887 2025-12-04T09:43:32.9723813Z * [new tag] ciflow/inductor/167972 -> ciflow/inductor/167972 2025-12-04T09:43:32.9724948Z * [new tag] ciflow/inductor/167989 -> ciflow/inductor/167989 2025-12-04T09:43:32.9726178Z * [new tag] ciflow/inductor/168002 -> ciflow/inductor/168002 2025-12-04T09:43:32.9727408Z * [new tag] ciflow/inductor/168050 -> ciflow/inductor/168050 2025-12-04T09:43:32.9728646Z * [new tag] ciflow/inductor/168051 -> ciflow/inductor/168051 2025-12-04T09:43:32.9729869Z * [new tag] ciflow/inductor/168052 -> ciflow/inductor/168052 2025-12-04T09:43:32.9731217Z * [new tag] ciflow/inductor/168073 -> ciflow/inductor/168073 2025-12-04T09:43:32.9732291Z * [new tag] ciflow/inductor/168096 -> ciflow/inductor/168096 2025-12-04T09:43:32.9734080Z * [new tag] ciflow/inductor/168114 -> ciflow/inductor/168114 2025-12-04T09:43:32.9735330Z * [new tag] ciflow/inductor/168115 -> ciflow/inductor/168115 2025-12-04T09:43:32.9736523Z * [new tag] ciflow/inductor/168127 -> ciflow/inductor/168127 2025-12-04T09:43:32.9737810Z * [new tag] ciflow/inductor/168129 -> ciflow/inductor/168129 2025-12-04T09:43:32.9739043Z * [new tag] ciflow/inductor/168157 -> ciflow/inductor/168157 2025-12-04T09:43:32.9740337Z * [new tag] ciflow/inductor/168175 -> ciflow/inductor/168175 2025-12-04T09:43:32.9741379Z * [new tag] ciflow/inductor/168185 -> ciflow/inductor/168185 2025-12-04T09:43:32.9742711Z * [new tag] ciflow/inductor/168195 -> ciflow/inductor/168195 2025-12-04T09:43:32.9743955Z * [new tag] ciflow/inductor/168209 -> ciflow/inductor/168209 2025-12-04T09:43:32.9745131Z * [new tag] ciflow/inductor/168266 -> ciflow/inductor/168266 2025-12-04T09:43:32.9746515Z * [new tag] ciflow/inductor/168316 -> ciflow/inductor/168316 2025-12-04T09:43:32.9747932Z * [new tag] ciflow/inductor/168326 -> ciflow/inductor/168326 2025-12-04T09:43:32.9749238Z * [new tag] ciflow/inductor/168368 -> ciflow/inductor/168368 2025-12-04T09:43:32.9750419Z * [new tag] ciflow/inductor/168894 -> ciflow/inductor/168894 2025-12-04T09:43:32.9751701Z * [new tag] ciflow/inductor/168934 -> ciflow/inductor/168934 2025-12-04T09:43:32.9752954Z * [new tag] ciflow/inductor/168939 -> ciflow/inductor/168939 2025-12-04T09:43:32.9754020Z * [new tag] ciflow/inductor/168946 -> ciflow/inductor/168946 2025-12-04T09:43:32.9755523Z * [new tag] ciflow/inductor/168950 -> ciflow/inductor/168950 2025-12-04T09:43:32.9758531Z * [new tag] ciflow/inductor/168951 -> ciflow/inductor/168951 2025-12-04T09:43:32.9759795Z * [new tag] ciflow/inductor/168952 -> ciflow/inductor/168952 2025-12-04T09:43:32.9761037Z * [new tag] ciflow/inductor/168955 -> ciflow/inductor/168955 2025-12-04T09:43:32.9762274Z * [new tag] ciflow/inductor/168971 -> ciflow/inductor/168971 2025-12-04T09:43:32.9763510Z * [new tag] ciflow/inductor/168979 -> ciflow/inductor/168979 2025-12-04T09:43:32.9764737Z * [new tag] ciflow/inductor/168980 -> ciflow/inductor/168980 2025-12-04T09:43:32.9766042Z * [new tag] ciflow/inductor/168983 -> ciflow/inductor/168983 2025-12-04T09:43:32.9767317Z * [new tag] ciflow/inductor/169006 -> ciflow/inductor/169006 2025-12-04T09:43:32.9768668Z * [new tag] ciflow/inductor/169023 -> ciflow/inductor/169023 2025-12-04T09:43:32.9769852Z * [new tag] ciflow/inductor/169024 -> ciflow/inductor/169024 2025-12-04T09:43:32.9771134Z * [new tag] ciflow/inductor/169025 -> ciflow/inductor/169025 2025-12-04T09:43:32.9772409Z * [new tag] ciflow/inductor/169066 -> ciflow/inductor/169066 2025-12-04T09:43:32.9773682Z * [new tag] ciflow/inductor/169091 -> ciflow/inductor/169091 2025-12-04T09:43:32.9774920Z * [new tag] ciflow/inductor/169102 -> ciflow/inductor/169102 2025-12-04T09:43:32.9776135Z * [new tag] ciflow/inductor/169103 -> ciflow/inductor/169103 2025-12-04T09:43:32.9777443Z * [new tag] ciflow/inductor/169121 -> ciflow/inductor/169121 2025-12-04T09:43:32.9778691Z * [new tag] ciflow/inductor/169134 -> ciflow/inductor/169134 2025-12-04T09:43:32.9779927Z * [new tag] ciflow/inductor/169135 -> ciflow/inductor/169135 2025-12-04T09:43:32.9781144Z * [new tag] ciflow/inductor/169141 -> ciflow/inductor/169141 2025-12-04T09:43:32.9782393Z * [new tag] ciflow/inductor/169151 -> ciflow/inductor/169151 2025-12-04T09:43:32.9783703Z * [new tag] ciflow/inductor/169161 -> ciflow/inductor/169161 2025-12-04T09:43:32.9784946Z * [new tag] ciflow/inductor/169167 -> ciflow/inductor/169167 2025-12-04T09:43:32.9786321Z * [new tag] ciflow/inductor/169177 -> ciflow/inductor/169177 2025-12-04T09:43:32.9787894Z * [new tag] ciflow/inductor/169185 -> ciflow/inductor/169185 2025-12-04T09:43:32.9789045Z * [new tag] ciflow/inductor/169196 -> ciflow/inductor/169196 2025-12-04T09:43:32.9790348Z * [new tag] ciflow/inductor/169200 -> ciflow/inductor/169200 2025-12-04T09:43:32.9791632Z * [new tag] ciflow/inductor/169204 -> ciflow/inductor/169204 2025-12-04T09:43:32.9792825Z * [new tag] ciflow/inductor/169216 -> ciflow/inductor/169216 2025-12-04T09:43:32.9794009Z * [new tag] ciflow/inductor/169219 -> ciflow/inductor/169219 2025-12-04T09:43:32.9795307Z * [new tag] ciflow/inductor/169220 -> ciflow/inductor/169220 2025-12-04T09:43:32.9796667Z * [new tag] ciflow/inductor/169230 -> ciflow/inductor/169230 2025-12-04T09:43:32.9797945Z * [new tag] ciflow/inductor/169242 -> ciflow/inductor/169242 2025-12-04T09:43:32.9799236Z * [new tag] ciflow/inductor/169245 -> ciflow/inductor/169245 2025-12-04T09:43:32.9800590Z * [new tag] ciflow/inductor/169260 -> ciflow/inductor/169260 2025-12-04T09:43:32.9801844Z * [new tag] ciflow/inductor/169282 -> ciflow/inductor/169282 2025-12-04T09:43:32.9803317Z * [new tag] ciflow/inductor/169286 -> ciflow/inductor/169286 2025-12-04T09:43:32.9804590Z * [new tag] ciflow/inductor/169299 -> ciflow/inductor/169299 2025-12-04T09:43:32.9805954Z * [new tag] ciflow/inductor/169304 -> ciflow/inductor/169304 2025-12-04T09:43:32.9807541Z * [new tag] ciflow/inductor/169305 -> ciflow/inductor/169305 2025-12-04T09:43:32.9808791Z * [new tag] ciflow/inductor/169308 -> ciflow/inductor/169308 2025-12-04T09:43:32.9810059Z * [new tag] ciflow/inductor/169319 -> ciflow/inductor/169319 2025-12-04T09:43:32.9811290Z * [new tag] ciflow/inductor/169326 -> ciflow/inductor/169326 2025-12-04T09:43:32.9812441Z * [new tag] ciflow/inductor/169332 -> ciflow/inductor/169332 2025-12-04T09:43:32.9813719Z * [new tag] ciflow/inductor/169333 -> ciflow/inductor/169333 2025-12-04T09:43:32.9815147Z * [new tag] ciflow/inductor/169336 -> ciflow/inductor/169336 2025-12-04T09:43:32.9816418Z * [new tag] ciflow/inductor/169340 -> ciflow/inductor/169340 2025-12-04T09:43:32.9818138Z * [new tag] ciflow/inductor/169341 -> ciflow/inductor/169341 2025-12-04T09:43:32.9819429Z * [new tag] ciflow/inductor/169343 -> ciflow/inductor/169343 2025-12-04T09:43:32.9820695Z * [new tag] ciflow/inductor/169346 -> ciflow/inductor/169346 2025-12-04T09:43:32.9821968Z * [new tag] ciflow/inductor/169348 -> ciflow/inductor/169348 2025-12-04T09:43:32.9823320Z * [new tag] ciflow/inductor/169350 -> ciflow/inductor/169350 2025-12-04T09:43:32.9824583Z * [new tag] ciflow/inductor/169355 -> ciflow/inductor/169355 2025-12-04T09:43:32.9825830Z * [new tag] ciflow/inductor/169370 -> ciflow/inductor/169370 2025-12-04T09:43:32.9827417Z * [new tag] ciflow/inductor/169375 -> ciflow/inductor/169375 2025-12-04T09:43:32.9828728Z * [new tag] ciflow/inductor/169389 -> ciflow/inductor/169389 2025-12-04T09:43:32.9829987Z * [new tag] ciflow/inductor/169391 -> ciflow/inductor/169391 2025-12-04T09:43:32.9831285Z * [new tag] ciflow/inductor/169393 -> ciflow/inductor/169393 2025-12-04T09:43:32.9832525Z * [new tag] ciflow/inductor/169399 -> ciflow/inductor/169399 2025-12-04T09:43:32.9833897Z * [new tag] ciflow/inductor/169400 -> ciflow/inductor/169400 2025-12-04T09:43:32.9835209Z * [new tag] ciflow/inductor/169415 -> ciflow/inductor/169415 2025-12-04T09:43:32.9836567Z * [new tag] ciflow/inductor/169417 -> ciflow/inductor/169417 2025-12-04T09:43:32.9837622Z * [new tag] ciflow/inductor/169418 -> ciflow/inductor/169418 2025-12-04T09:43:32.9839131Z * [new tag] ciflow/inductor/169430 -> ciflow/inductor/169430 2025-12-04T09:43:32.9840428Z * [new tag] ciflow/inductor/169432 -> ciflow/inductor/169432 2025-12-04T09:43:32.9841661Z * [new tag] ciflow/inductor/169436 -> ciflow/inductor/169436 2025-12-04T09:43:32.9842889Z * [new tag] ciflow/inductor/169437 -> ciflow/inductor/169437 2025-12-04T09:43:32.9844228Z * [new tag] ciflow/inductor/169438 -> ciflow/inductor/169438 2025-12-04T09:43:32.9845526Z * [new tag] ciflow/inductor/169441 -> ciflow/inductor/169441 2025-12-04T09:43:32.9846608Z * [new tag] ciflow/inductor/169446 -> ciflow/inductor/169446 2025-12-04T09:43:32.9848059Z * [new tag] ciflow/inductor/169447 -> ciflow/inductor/169447 2025-12-04T09:43:32.9849330Z * [new tag] ciflow/inductor/169452 -> ciflow/inductor/169452 2025-12-04T09:43:32.9850717Z * [new tag] ciflow/inductor/169455 -> ciflow/inductor/169455 2025-12-04T09:43:32.9851947Z * [new tag] ciflow/inductor/169459 -> ciflow/inductor/169459 2025-12-04T09:43:32.9853284Z * [new tag] ciflow/inductor/169463 -> ciflow/inductor/169463 2025-12-04T09:43:32.9854717Z * [new tag] ciflow/inductor/169476 -> ciflow/inductor/169476 2025-12-04T09:43:32.9856270Z * [new tag] ciflow/inductor/169485 -> ciflow/inductor/169485 2025-12-04T09:43:32.9857426Z * [new tag] ciflow/inductor/169493 -> ciflow/inductor/169493 2025-12-04T09:43:32.9858699Z * [new tag] ciflow/inductor/169496 -> ciflow/inductor/169496 2025-12-04T09:43:32.9859965Z * [new tag] ciflow/inductor/169497 -> ciflow/inductor/169497 2025-12-04T09:43:32.9861276Z * [new tag] ciflow/inductor/169503 -> ciflow/inductor/169503 2025-12-04T09:43:32.9862459Z * [new tag] ciflow/inductor/169504 -> ciflow/inductor/169504 2025-12-04T09:43:32.9863894Z * [new tag] ciflow/inductor/169505 -> ciflow/inductor/169505 2025-12-04T09:43:32.9865585Z * [new tag] ciflow/inductor/169508 -> ciflow/inductor/169508 2025-12-04T09:43:32.9866861Z * [new tag] ciflow/inductor/169509 -> ciflow/inductor/169509 2025-12-04T09:43:32.9868266Z * [new tag] ciflow/inductor/169513 -> ciflow/inductor/169513 2025-12-04T09:43:32.9869501Z * [new tag] ciflow/inductor/169514 -> ciflow/inductor/169514 2025-12-04T09:43:32.9870760Z * [new tag] ciflow/inductor/169515 -> ciflow/inductor/169515 2025-12-04T09:43:32.9871998Z * [new tag] ciflow/inductor/169517 -> ciflow/inductor/169517 2025-12-04T09:43:32.9873276Z * [new tag] ciflow/inductor/169519 -> ciflow/inductor/169519 2025-12-04T09:43:32.9874536Z * [new tag] ciflow/inductor/169520 -> ciflow/inductor/169520 2025-12-04T09:43:32.9875814Z * [new tag] ciflow/inductor/169521 -> ciflow/inductor/169521 2025-12-04T09:43:32.9877076Z * [new tag] ciflow/inductor/169524 -> ciflow/inductor/169524 2025-12-04T09:43:32.9878312Z * [new tag] ciflow/inductor/169527 -> ciflow/inductor/169527 2025-12-04T09:43:32.9879546Z * [new tag] ciflow/inductor/169528 -> ciflow/inductor/169528 2025-12-04T09:43:32.9880888Z * [new tag] ciflow/inductor/169532 -> ciflow/inductor/169532 2025-12-04T09:43:32.9882169Z * [new tag] ciflow/inductor/169535 -> ciflow/inductor/169535 2025-12-04T09:43:32.9883412Z * [new tag] ciflow/inductor/169536 -> ciflow/inductor/169536 2025-12-04T09:43:32.9884902Z * [new tag] ciflow/inductor/169547 -> ciflow/inductor/169547 2025-12-04T09:43:32.9885867Z * [new tag] ciflow/inductor/169548 -> ciflow/inductor/169548 2025-12-04T09:43:32.9887176Z * [new tag] ciflow/inductor/169549 -> ciflow/inductor/169549 2025-12-04T09:43:32.9888351Z * [new tag] ciflow/inductor/169551 -> ciflow/inductor/169551 2025-12-04T09:43:32.9889546Z * [new tag] ciflow/inductor/169552 -> ciflow/inductor/169552 2025-12-04T09:43:32.9890928Z * [new tag] ciflow/inductor/169553 -> ciflow/inductor/169553 2025-12-04T09:43:32.9892161Z * [new tag] ciflow/inductor/169557 -> ciflow/inductor/169557 2025-12-04T09:43:32.9893608Z * [new tag] ciflow/inductor/3b9a386 -> ciflow/inductor/3b9a386 2025-12-04T09:43:32.9895052Z * [new tag] ciflow/inductor/3d4b92b -> ciflow/inductor/3d4b92b 2025-12-04T09:43:32.9896460Z * [new tag] ciflow/inductor/d224ac7 -> ciflow/inductor/d224ac7 2025-12-04T09:43:32.9897908Z * [new tag] ciflow/linux-aarch64/157994 -> ciflow/linux-aarch64/157994 2025-12-04T09:43:32.9899010Z * [new tag] ciflow/linux-aarch64/166075 -> ciflow/linux-aarch64/166075 2025-12-04T09:43:32.9900278Z * [new tag] ciflow/linux-aarch64/166876 -> ciflow/linux-aarch64/166876 2025-12-04T09:43:32.9901549Z * [new tag] ciflow/linux-aarch64/167981 -> ciflow/linux-aarch64/167981 2025-12-04T09:43:32.9903360Z * [new tag] ciflow/mps/166254 -> ciflow/mps/166254 2025-12-04T09:43:32.9904567Z * [new tag] ciflow/mps/169017 -> ciflow/mps/169017 2025-12-04T09:43:32.9905804Z * [new tag] ciflow/mps/169372 -> ciflow/mps/169372 2025-12-04T09:43:32.9907047Z * [new tag] ciflow/mps/169478 -> ciflow/mps/169478 2025-12-04T09:43:32.9908567Z * [new tag] ciflow/op-benchmark/157994 -> ciflow/op-benchmark/157994 2025-12-04T09:43:32.9909634Z * [new tag] ciflow/op-benchmark/166075 -> ciflow/op-benchmark/166075 2025-12-04T09:43:32.9910920Z * [new tag] ciflow/op-benchmark/169544 -> ciflow/op-benchmark/169544 2025-12-04T09:43:32.9912394Z * [new tag] ciflow/periodic-rocm-mi200/165997 -> ciflow/periodic-rocm-mi200/165997 2025-12-04T09:43:32.9913569Z * [new tag] ciflow/periodic-rocm-mi200/166517 -> ciflow/periodic-rocm-mi200/166517 2025-12-04T09:43:32.9914693Z * [new tag] ciflow/periodic-rocm-mi200/169063 -> ciflow/periodic-rocm-mi200/169063 2025-12-04T09:43:32.9915972Z * [new tag] ciflow/periodic-rocm-mi200/169425 -> ciflow/periodic-rocm-mi200/169425 2025-12-04T09:43:32.9917371Z * [new tag] ciflow/periodic-rocm-mi300/166517 -> ciflow/periodic-rocm-mi300/166517 2025-12-04T09:43:32.9918447Z * [new tag] ciflow/periodic-rocm-mi300/169063 -> ciflow/periodic-rocm-mi300/169063 2025-12-04T09:43:32.9919561Z * [new tag] ciflow/periodic-rocm-mi300/169425 -> ciflow/periodic-rocm-mi300/169425 2025-12-04T09:43:32.9921301Z * [new tag] ciflow/periodic/054a2fd -> ciflow/periodic/054a2fd 2025-12-04T09:43:32.9922385Z * [new tag] ciflow/periodic/167207 -> ciflow/periodic/167207 2025-12-04T09:43:32.9923715Z * [new tag] ciflow/periodic/167978 -> ciflow/periodic/167978 2025-12-04T09:43:32.9924836Z * [new tag] ciflow/periodic/168096 -> ciflow/periodic/168096 2025-12-04T09:43:32.9925969Z * [new tag] ciflow/periodic/169286 -> ciflow/periodic/169286 2025-12-04T09:43:32.9927365Z * [new tag] ciflow/periodic/2a6d37d -> ciflow/periodic/2a6d37d 2025-12-04T09:43:32.9928688Z * [new tag] ciflow/periodic/317eeb8 -> ciflow/periodic/317eeb8 2025-12-04T09:43:32.9930302Z * [new tag] ciflow/periodic/3c32 -> ciflow/periodic/3c32 2025-12-04T09:43:32.9931792Z * [new tag] ciflow/periodic/3e98831 -> ciflow/periodic/3e98831 2025-12-04T09:43:32.9933624Z * [new tag] ciflow/periodic/7c648509a7470ace9fb2bae960dd4790f7e943e9 -> ciflow/periodic/7c648509a7470ace9fb2bae960dd4790f7e943e9 2025-12-04T09:43:32.9935047Z * [new tag] ciflow/periodic/94512-point -> ciflow/periodic/94512-point 2025-12-04T09:43:32.9936849Z * [new tag] ciflow/periodic/csl/test87519 -> ciflow/periodic/csl/test87519 2025-12-04T09:43:32.9938222Z * [new tag] ciflow/periodic/csltest88275 -> ciflow/periodic/csltest88275 2025-12-04T09:43:32.9939571Z * [new tag] ciflow/periodic/csltest88761 -> ciflow/periodic/csltest88761 2025-12-04T09:43:32.9941001Z * [new tag] ciflow/periodic/release_1.12 -> ciflow/periodic/release_1.12 2025-12-04T09:43:32.9942455Z * [new tag] ciflow/periodic/release_1.12.0 -> ciflow/periodic/release_1.12.0 2025-12-04T09:43:32.9943992Z * [new tag] ciflow/periodic/sha-ec5b83 -> ciflow/periodic/sha-ec5b83 2025-12-04T09:43:32.9945391Z * [new tag] ciflow/pull/167207 -> ciflow/pull/167207 2025-12-04T09:43:32.9947055Z * [new tag] ciflow/quantization-periodic/169207 -> ciflow/quantization-periodic/169207 2025-12-04T09:43:32.9948525Z * [new tag] ciflow/rocm-mi200/165545 -> ciflow/rocm-mi200/165545 2025-12-04T09:43:32.9949607Z * [new tag] ciflow/rocm-mi200/165997 -> ciflow/rocm-mi200/165997 2025-12-04T09:43:32.9950894Z * [new tag] ciflow/rocm-mi200/168096 -> ciflow/rocm-mi200/168096 2025-12-04T09:43:32.9952178Z * [new tag] ciflow/rocm-mi200/168275 -> ciflow/rocm-mi200/168275 2025-12-04T09:43:32.9953268Z * [new tag] ciflow/rocm-mi200/169063 -> ciflow/rocm-mi200/169063 2025-12-04T09:43:32.9954631Z * [new tag] ciflow/rocm-mi200/169356 -> ciflow/rocm-mi200/169356 2025-12-04T09:43:32.9955729Z * [new tag] ciflow/rocm-mi200/169425 -> ciflow/rocm-mi200/169425 2025-12-04T09:43:32.9957375Z * [new tag] ciflow/rocm-mi300/165545 -> ciflow/rocm-mi300/165545 2025-12-04T09:43:32.9958703Z * [new tag] ciflow/rocm-mi300/167157 -> ciflow/rocm-mi300/167157 2025-12-04T09:43:32.9959939Z * [new tag] ciflow/rocm-mi300/168096 -> ciflow/rocm-mi300/168096 2025-12-04T09:43:32.9960957Z * [new tag] ciflow/rocm-mi300/169063 -> ciflow/rocm-mi300/169063 2025-12-04T09:43:32.9962208Z * [new tag] ciflow/rocm-mi300/169425 -> ciflow/rocm-mi300/169425 2025-12-04T09:43:32.9963997Z * [new tag] ciflow/rocm-mi355/167157 -> ciflow/rocm-mi355/167157 2025-12-04T09:43:32.9965300Z * [new tag] ciflow/rocm-mi355/168275 -> ciflow/rocm-mi355/168275 2025-12-04T09:43:32.9966350Z * [new tag] ciflow/rocm-mi355/169425 -> ciflow/rocm-mi355/169425 2025-12-04T09:43:32.9967817Z * [new tag] ciflow/rocm-navi31/168275 -> ciflow/rocm-navi31/168275 2025-12-04T09:43:32.9968894Z * [new tag] ciflow/rocm-navi31/169425 -> ciflow/rocm-navi31/169425 2025-12-04T09:43:32.9970431Z * [new tag] ciflow/rocm/115316 -> ciflow/rocm/115316 2025-12-04T09:43:32.9971639Z * [new tag] ciflow/rocm/148492 -> ciflow/rocm/148492 2025-12-04T09:43:32.9972664Z * [new tag] ciflow/rocm/160685 -> ciflow/rocm/160685 2025-12-04T09:43:32.9973900Z * [new tag] ciflow/rocm/161607 -> ciflow/rocm/161607 2025-12-04T09:43:32.9975105Z * [new tag] ciflow/rocm/162052 -> ciflow/rocm/162052 2025-12-04T09:43:32.9976130Z * [new tag] ciflow/rocm/165997 -> ciflow/rocm/165997 2025-12-04T09:43:32.9977510Z * [new tag] ciflow/rocm/166165 -> ciflow/rocm/166165 2025-12-04T09:43:32.9978480Z * [new tag] ciflow/rocm/166517 -> ciflow/rocm/166517 2025-12-04T09:43:32.9980252Z * [new tag] ciflow/rocm/167207 -> ciflow/rocm/167207 2025-12-04T09:43:32.9981411Z * [new tag] ciflow/rocm/167536 -> ciflow/rocm/167536 2025-12-04T09:43:32.9982546Z * [new tag] ciflow/rocm/167781 -> ciflow/rocm/167781 2025-12-04T09:43:32.9984021Z * [new tag] ciflow/rocm/167989 -> ciflow/rocm/167989 2025-12-04T09:43:32.9985569Z * [new tag] ciflow/rocm/168073 -> ciflow/rocm/168073 2025-12-04T09:43:32.9986969Z * [new tag] ciflow/rocm/168195 -> ciflow/rocm/168195 2025-12-04T09:43:32.9988344Z * [new tag] ciflow/rocm/168939 -> ciflow/rocm/168939 2025-12-04T09:43:32.9989576Z * [new tag] ciflow/rocm/168971 -> ciflow/rocm/168971 2025-12-04T09:43:32.9990823Z * [new tag] ciflow/rocm/169024 -> ciflow/rocm/169024 2025-12-04T09:43:32.9991967Z * [new tag] ciflow/rocm/169200 -> ciflow/rocm/169200 2025-12-04T09:43:32.9993212Z * [new tag] ciflow/rocm/169216 -> ciflow/rocm/169216 2025-12-04T09:43:32.9994443Z * [new tag] ciflow/rocm/169312 -> ciflow/rocm/169312 2025-12-04T09:43:32.9995706Z * [new tag] ciflow/rocm/169380 -> ciflow/rocm/169380 2025-12-04T09:43:32.9996922Z * [new tag] ciflow/rocm/169427 -> ciflow/rocm/169427 2025-12-04T09:43:32.9998144Z * [new tag] ciflow/rocm/169455 -> ciflow/rocm/169455 2025-12-04T09:43:32.9999428Z * [new tag] ciflow/rocm/169470 -> ciflow/rocm/169470 2025-12-04T09:43:33.0000659Z * [new tag] ciflow/rocm/169471 -> ciflow/rocm/169471 2025-12-04T09:43:33.0001893Z * [new tag] ciflow/rocm/169472 -> ciflow/rocm/169472 2025-12-04T09:43:33.0003098Z * [new tag] ciflow/rocm/169514 -> ciflow/rocm/169514 2025-12-04T09:43:33.0004666Z * [new tag] ciflow/slow/01c7106 -> ciflow/slow/01c7106 2025-12-04T09:43:33.0005980Z * [new tag] ciflow/slow/0577043 -> ciflow/slow/0577043 2025-12-04T09:43:33.0007606Z * [new tag] ciflow/slow/0d5b74da0cab798fbfdb9caa53fad816999c8386-sdym -> ciflow/slow/0d5b74da0cab798fbfdb9caa53fad816999c8386-sdym 2025-12-04T09:43:33.0008678Z * [new tag] ciflow/slow/0e81104 -> ciflow/slow/0e81104 2025-12-04T09:43:33.0010031Z * [new tag] ciflow/slow/167207 -> ciflow/slow/167207 2025-12-04T09:43:33.0011243Z * [new tag] ciflow/slow/168050 -> ciflow/slow/168050 2025-12-04T09:43:33.0012519Z * [new tag] ciflow/slow/1732077 -> ciflow/slow/1732077 2025-12-04T09:43:33.0013864Z * [new tag] ciflow/slow/187eb7c -> ciflow/slow/187eb7c 2025-12-04T09:43:33.0015523Z * [new tag] ciflow/slow/1faef89 -> ciflow/slow/1faef89 2025-12-04T09:43:33.0017203Z * [new tag] ciflow/slow/3920ec1 -> ciflow/slow/3920ec1 2025-12-04T09:43:33.0018758Z * [new tag] ciflow/slow/3b7c6b2 -> ciflow/slow/3b7c6b2 2025-12-04T09:43:33.0020173Z * [new tag] ciflow/slow/59a3759 -> ciflow/slow/59a3759 2025-12-04T09:43:33.0021539Z * [new tag] ciflow/slow/70ef0bb -> ciflow/slow/70ef0bb 2025-12-04T09:43:33.0022968Z * [new tag] ciflow/slow/788ff06 -> ciflow/slow/788ff06 2025-12-04T09:43:33.0024682Z * [new tag] ciflow/slow/8751002215790a3a88750faa8f4366933e296693-sdym -> ciflow/slow/8751002215790a3a88750faa8f4366933e296693-sdym 2025-12-04T09:43:33.0025689Z * [new tag] ciflow/slow/9d85864 -> ciflow/slow/9d85864 2025-12-04T09:43:33.0027359Z * [new tag] ciflow/slow/9ffad5b -> ciflow/slow/9ffad5b 2025-12-04T09:43:33.0028722Z * [new tag] ciflow/slow/a206e8b -> ciflow/slow/a206e8b 2025-12-04T09:43:33.0030061Z * [new tag] ciflow/slow/a837609 -> ciflow/slow/a837609 2025-12-04T09:43:33.0031471Z * [new tag] ciflow/slow/af841f3 -> ciflow/slow/af841f3 2025-12-04T09:43:33.0033233Z * [new tag] ciflow/slow/da3aba1e46157c4df504b067477cdf2b3c96b194-sdym -> ciflow/slow/da3aba1e46157c4df504b067477cdf2b3c96b194-sdym 2025-12-04T09:43:33.0034391Z * [new tag] ciflow/torchbench/168175 -> ciflow/torchbench/168175 2025-12-04T09:43:33.0035880Z * [new tag] ciflow/trunk/148492 -> ciflow/trunk/148492 2025-12-04T09:43:33.0037112Z * [new tag] ciflow/trunk/157149 -> ciflow/trunk/157149 2025-12-04T09:43:33.0038691Z * [new tag] ciflow/trunk/157994 -> ciflow/trunk/157994 2025-12-04T09:43:33.0039925Z * [new tag] ciflow/trunk/159718 -> ciflow/trunk/159718 2025-12-04T09:43:33.0040969Z * [new tag] ciflow/trunk/160685 -> ciflow/trunk/160685 2025-12-04T09:43:33.0042217Z * [new tag] ciflow/trunk/160729 -> ciflow/trunk/160729 2025-12-04T09:43:33.0043273Z * [new tag] ciflow/trunk/162275 -> ciflow/trunk/162275 2025-12-04T09:43:33.0044542Z * [new tag] ciflow/trunk/162795 -> ciflow/trunk/162795 2025-12-04T09:43:33.0045587Z * [new tag] ciflow/trunk/163245 -> ciflow/trunk/163245 2025-12-04T09:43:33.0046839Z * [new tag] ciflow/trunk/163942 -> ciflow/trunk/163942 2025-12-04T09:43:33.0048047Z * [new tag] ciflow/trunk/165274 -> ciflow/trunk/165274 2025-12-04T09:43:33.0049575Z * [new tag] ciflow/trunk/165483 -> ciflow/trunk/165483 2025-12-04T09:43:33.0051071Z * [new tag] ciflow/trunk/165728 -> ciflow/trunk/165728 2025-12-04T09:43:33.0052474Z * [new tag] ciflow/trunk/165922 -> ciflow/trunk/165922 2025-12-04T09:43:33.0053732Z * [new tag] ciflow/trunk/166075 -> ciflow/trunk/166075 2025-12-04T09:43:33.0054977Z * [new tag] ciflow/trunk/166165 -> ciflow/trunk/166165 2025-12-04T09:43:33.0056500Z * [new tag] ciflow/trunk/166829 -> ciflow/trunk/166829 2025-12-04T09:43:33.0057821Z * [new tag] ciflow/trunk/166843 -> ciflow/trunk/166843 2025-12-04T09:43:33.0059064Z * [new tag] ciflow/trunk/166876 -> ciflow/trunk/166876 2025-12-04T09:43:33.0060353Z * [new tag] ciflow/trunk/167207 -> ciflow/trunk/167207 2025-12-04T09:43:33.0061588Z * [new tag] ciflow/trunk/167536 -> ciflow/trunk/167536 2025-12-04T09:43:33.0062882Z * [new tag] ciflow/trunk/167552 -> ciflow/trunk/167552 2025-12-04T09:43:33.0064138Z * [new tag] ciflow/trunk/167555 -> ciflow/trunk/167555 2025-12-04T09:43:33.0065399Z * [new tag] ciflow/trunk/167599 -> ciflow/trunk/167599 2025-12-04T09:43:33.0066621Z * [new tag] ciflow/trunk/167659 -> ciflow/trunk/167659 2025-12-04T09:43:33.0068105Z * [new tag] ciflow/trunk/167672 -> ciflow/trunk/167672 2025-12-04T09:43:33.0069378Z * [new tag] ciflow/trunk/167742 -> ciflow/trunk/167742 2025-12-04T09:43:33.0070719Z * [new tag] ciflow/trunk/167781 -> ciflow/trunk/167781 2025-12-04T09:43:33.0072040Z * [new tag] ciflow/trunk/167837 -> ciflow/trunk/167837 2025-12-04T09:43:33.0073316Z * [new tag] ciflow/trunk/167887 -> ciflow/trunk/167887 2025-12-04T09:43:33.0074572Z * [new tag] ciflow/trunk/167978 -> ciflow/trunk/167978 2025-12-04T09:43:33.0075936Z * [new tag] ciflow/trunk/168050 -> ciflow/trunk/168050 2025-12-04T09:43:33.0076895Z * [new tag] ciflow/trunk/168051 -> ciflow/trunk/168051 2025-12-04T09:43:33.0078225Z * [new tag] ciflow/trunk/168096 -> ciflow/trunk/168096 2025-12-04T09:43:33.0079481Z * [new tag] ciflow/trunk/168127 -> ciflow/trunk/168127 2025-12-04T09:43:33.0080677Z * [new tag] ciflow/trunk/168157 -> ciflow/trunk/168157 2025-12-04T09:43:33.0081785Z * [new tag] ciflow/trunk/168175 -> ciflow/trunk/168175 2025-12-04T09:43:33.0083085Z * [new tag] ciflow/trunk/168209 -> ciflow/trunk/168209 2025-12-04T09:43:33.0084509Z * [new tag] ciflow/trunk/168213 -> ciflow/trunk/168213 2025-12-04T09:43:33.0085896Z * [new tag] ciflow/trunk/168226 -> ciflow/trunk/168226 2025-12-04T09:43:33.0087191Z * [new tag] ciflow/trunk/168262 -> ciflow/trunk/168262 2025-12-04T09:43:33.0088425Z * [new tag] ciflow/trunk/168275 -> ciflow/trunk/168275 2025-12-04T09:43:33.0089810Z * [new tag] ciflow/trunk/168328 -> ciflow/trunk/168328 2025-12-04T09:43:33.0091024Z * [new tag] ciflow/trunk/168368 -> ciflow/trunk/168368 2025-12-04T09:43:33.0092258Z * [new tag] ciflow/trunk/168917 -> ciflow/trunk/168917 2025-12-04T09:43:33.0093484Z * [new tag] ciflow/trunk/168933 -> ciflow/trunk/168933 2025-12-04T09:43:33.0094826Z * [new tag] ciflow/trunk/168941 -> ciflow/trunk/168941 2025-12-04T09:43:33.0096043Z * [new tag] ciflow/trunk/168955 -> ciflow/trunk/168955 2025-12-04T09:43:33.0097278Z * [new tag] ciflow/trunk/168980 -> ciflow/trunk/168980 2025-12-04T09:43:33.0098649Z * [new tag] ciflow/trunk/169004 -> ciflow/trunk/169004 2025-12-04T09:43:33.0099974Z * [new tag] ciflow/trunk/169006 -> ciflow/trunk/169006 2025-12-04T09:43:33.0101200Z * [new tag] ciflow/trunk/169023 -> ciflow/trunk/169023 2025-12-04T09:43:33.0102440Z * [new tag] ciflow/trunk/169025 -> ciflow/trunk/169025 2025-12-04T09:43:33.0103703Z * [new tag] ciflow/trunk/169048 -> ciflow/trunk/169048 2025-12-04T09:43:33.0104984Z * [new tag] ciflow/trunk/169066 -> ciflow/trunk/169066 2025-12-04T09:43:33.0106183Z * [new tag] ciflow/trunk/169091 -> ciflow/trunk/169091 2025-12-04T09:43:33.0107444Z * [new tag] ciflow/trunk/169102 -> ciflow/trunk/169102 2025-12-04T09:43:33.0108711Z * [new tag] ciflow/trunk/169103 -> ciflow/trunk/169103 2025-12-04T09:43:33.0110058Z * [new tag] ciflow/trunk/169125 -> ciflow/trunk/169125 2025-12-04T09:43:33.0111369Z * [new tag] ciflow/trunk/169139 -> ciflow/trunk/169139 2025-12-04T09:43:33.0112676Z * [new tag] ciflow/trunk/169148 -> ciflow/trunk/169148 2025-12-04T09:43:33.0113978Z * [new tag] ciflow/trunk/169151 -> ciflow/trunk/169151 2025-12-04T09:43:33.0115307Z * [new tag] ciflow/trunk/169156 -> ciflow/trunk/169156 2025-12-04T09:43:33.0116599Z * [new tag] ciflow/trunk/169176 -> ciflow/trunk/169176 2025-12-04T09:43:33.0117849Z * [new tag] ciflow/trunk/169204 -> ciflow/trunk/169204 2025-12-04T09:43:33.0119063Z * [new tag] ciflow/trunk/169207 -> ciflow/trunk/169207 2025-12-04T09:43:33.0120759Z * [new tag] ciflow/trunk/169211 -> ciflow/trunk/169211 2025-12-04T09:43:33.0122146Z * [new tag] ciflow/trunk/169231 -> ciflow/trunk/169231 2025-12-04T09:43:33.0123516Z * [new tag] ciflow/trunk/169260 -> ciflow/trunk/169260 2025-12-04T09:43:33.0124924Z * [new tag] ciflow/trunk/169271 -> ciflow/trunk/169271 2025-12-04T09:43:33.0126185Z * [new tag] ciflow/trunk/169280 -> ciflow/trunk/169280 2025-12-04T09:43:33.0127418Z * [new tag] ciflow/trunk/169281 -> ciflow/trunk/169281 2025-12-04T09:43:33.0128632Z * [new tag] ciflow/trunk/169286 -> ciflow/trunk/169286 2025-12-04T09:43:33.0130006Z * [new tag] ciflow/trunk/169293 -> ciflow/trunk/169293 2025-12-04T09:43:33.0131270Z * [new tag] ciflow/trunk/169296 -> ciflow/trunk/169296 2025-12-04T09:43:33.0132481Z * [new tag] ciflow/trunk/169304 -> ciflow/trunk/169304 2025-12-04T09:43:33.0133724Z * [new tag] ciflow/trunk/169305 -> ciflow/trunk/169305 2025-12-04T09:43:33.0134942Z * [new tag] ciflow/trunk/169312 -> ciflow/trunk/169312 2025-12-04T09:43:33.0136365Z * [new tag] ciflow/trunk/169328 -> ciflow/trunk/169328 2025-12-04T09:43:33.0137618Z * [new tag] ciflow/trunk/169343 -> ciflow/trunk/169343 2025-12-04T09:43:33.0138848Z * [new tag] ciflow/trunk/169355 -> ciflow/trunk/169355 2025-12-04T09:43:33.0140081Z * [new tag] ciflow/trunk/169370 -> ciflow/trunk/169370 2025-12-04T09:43:33.0141429Z * [new tag] ciflow/trunk/169379 -> ciflow/trunk/169379 2025-12-04T09:43:33.0142713Z * [new tag] ciflow/trunk/169380 -> ciflow/trunk/169380 2025-12-04T09:43:33.0143971Z * [new tag] ciflow/trunk/169385 -> ciflow/trunk/169385 2025-12-04T09:43:33.0145278Z * [new tag] ciflow/trunk/169387 -> ciflow/trunk/169387 2025-12-04T09:43:33.0146650Z * [new tag] ciflow/trunk/169410 -> ciflow/trunk/169410 2025-12-04T09:43:33.0148004Z * [new tag] ciflow/trunk/169412 -> ciflow/trunk/169412 2025-12-04T09:43:33.0149240Z * [new tag] ciflow/trunk/169418 -> ciflow/trunk/169418 2025-12-04T09:43:33.0150445Z * [new tag] ciflow/trunk/169423 -> ciflow/trunk/169423 2025-12-04T09:43:33.0151688Z * [new tag] ciflow/trunk/169427 -> ciflow/trunk/169427 2025-12-04T09:43:33.0152916Z * [new tag] ciflow/trunk/169430 -> ciflow/trunk/169430 2025-12-04T09:43:33.0154137Z * [new tag] ciflow/trunk/169437 -> ciflow/trunk/169437 2025-12-04T09:43:33.0155574Z * [new tag] ciflow/trunk/169442 -> ciflow/trunk/169442 2025-12-04T09:43:33.0158955Z * [new tag] ciflow/trunk/169452 -> ciflow/trunk/169452 2025-12-04T09:43:33.0160213Z * [new tag] ciflow/trunk/169454 -> ciflow/trunk/169454 2025-12-04T09:43:33.0161430Z * [new tag] ciflow/trunk/169459 -> ciflow/trunk/169459 2025-12-04T09:43:33.0162849Z * [new tag] ciflow/trunk/169474 -> ciflow/trunk/169474 2025-12-04T09:43:33.0164119Z * [new tag] ciflow/trunk/169475 -> ciflow/trunk/169475 2025-12-04T09:43:33.0165407Z * [new tag] ciflow/trunk/169476 -> ciflow/trunk/169476 2025-12-04T09:43:33.0166787Z * [new tag] ciflow/trunk/169487 -> ciflow/trunk/169487 2025-12-04T09:43:33.0168039Z * [new tag] ciflow/trunk/169497 -> ciflow/trunk/169497 2025-12-04T09:43:33.0169293Z * [new tag] ciflow/trunk/169503 -> ciflow/trunk/169503 2025-12-04T09:43:33.0170541Z * [new tag] ciflow/trunk/169505 -> ciflow/trunk/169505 2025-12-04T09:43:33.0171777Z * [new tag] ciflow/trunk/169507 -> ciflow/trunk/169507 2025-12-04T09:43:33.0172997Z * [new tag] ciflow/trunk/169514 -> ciflow/trunk/169514 2025-12-04T09:43:33.0174354Z * [new tag] ciflow/trunk/169517 -> ciflow/trunk/169517 2025-12-04T09:43:33.0175493Z * [new tag] ciflow/trunk/169519 -> ciflow/trunk/169519 2025-12-04T09:43:33.0176685Z * [new tag] ciflow/trunk/169528 -> ciflow/trunk/169528 2025-12-04T09:43:33.0177896Z * [new tag] ciflow/trunk/169541 -> ciflow/trunk/169541 2025-12-04T09:43:33.0179273Z * [new tag] ciflow/trunk/169555 -> ciflow/trunk/169555 2025-12-04T09:43:33.0180942Z * [new tag] ciflow/unstable/123 -> ciflow/unstable/123 2025-12-04T09:43:33.0182367Z * [new tag] ciflow/vllm/165270 -> ciflow/vllm/165270 2025-12-04T09:43:33.0183580Z * [new tag] ciflow/vllm/165274 -> ciflow/vllm/165274 2025-12-04T09:43:33.0185169Z * [new tag] ciflow/vllm/166494 -> ciflow/vllm/166494 2025-12-04T09:43:33.0186454Z * [new tag] ciflow/vllm/169219 -> ciflow/vllm/169219 2025-12-04T09:43:33.0187665Z * [new tag] ciflow/vllm/169220 -> ciflow/vllm/169220 2025-12-04T09:43:33.0189498Z * [new tag] ciflow/xpu/157994 -> ciflow/xpu/157994 2025-12-04T09:43:33.0190739Z * [new tag] ciflow/xpu/159718 -> ciflow/xpu/159718 2025-12-04T09:43:33.0192057Z * [new tag] ciflow/xpu/161940 -> ciflow/xpu/161940 2025-12-04T09:43:33.0193159Z * [new tag] ciflow/xpu/163251 -> ciflow/xpu/163251 2025-12-04T09:43:33.0194476Z * [new tag] ciflow/xpu/166829 -> ciflow/xpu/166829 2025-12-04T09:43:33.0195437Z * [new tag] ciflow/xpu/166843 -> ciflow/xpu/166843 2025-12-04T09:43:33.0196728Z * [new tag] ciflow/xpu/167972 -> ciflow/xpu/167972 2025-12-04T09:43:33.0197745Z * [new tag] ciflow/xpu/167981 -> ciflow/xpu/167981 2025-12-04T09:43:33.0199094Z * [new tag] ciflow/xpu/168213 -> ciflow/xpu/168213 2025-12-04T09:43:33.0200041Z * [new tag] ciflow/xpu/168262 -> ciflow/xpu/168262 2025-12-04T09:43:33.0201374Z * [new tag] ciflow/xpu/168328 -> ciflow/xpu/168328 2025-12-04T09:43:33.0202828Z * [new tag] ciflow/xpu/168950 -> ciflow/xpu/168950 2025-12-04T09:43:33.0204484Z * [new tag] ciflow/xpu/169039 -> ciflow/xpu/169039 2025-12-04T09:43:33.0205879Z * [new tag] ciflow/xpu/169200 -> ciflow/xpu/169200 2025-12-04T09:43:33.0207179Z * [new tag] ciflow/xpu/169203 -> ciflow/xpu/169203 2025-12-04T09:43:33.0208203Z * [new tag] ciflow/xpu/169230 -> ciflow/xpu/169230 2025-12-04T09:43:33.0209612Z * [new tag] ciflow/xpu/169231 -> ciflow/xpu/169231 2025-12-04T09:43:33.0210924Z * [new tag] ciflow/xpu/169241 -> ciflow/xpu/169241 2025-12-04T09:43:33.0212095Z * [new tag] ciflow/xpu/169280 -> ciflow/xpu/169280 2025-12-04T09:43:33.0213369Z * [new tag] ciflow/xpu/169296 -> ciflow/xpu/169296 2025-12-04T09:43:33.0214787Z * [new tag] ciflow/xpu/169353 -> ciflow/xpu/169353 2025-12-04T09:43:33.0215801Z * [new tag] ciflow/xpu/169410 -> ciflow/xpu/169410 2025-12-04T09:43:33.0217177Z * [new tag] ciflow/xpu/169442 -> ciflow/xpu/169442 2025-12-04T09:43:33.0218463Z * [new tag] ciflow/xpu/169555 -> ciflow/xpu/169555 2025-12-04T09:43:33.0219779Z * [new tag] cslpull75 -> cslpull75 2025-12-04T09:43:33.0220786Z * [new tag] cslpull76 -> cslpull76 2025-12-04T09:43:33.0222147Z * [new tag] cslpull77 -> cslpull77 2025-12-04T09:43:33.0223444Z * [new tag] cslpull78 -> cslpull78 2025-12-04T09:43:33.0224959Z * [new tag] cslpull79 -> cslpull79 2025-12-04T09:43:33.0226561Z * [new tag] cslpull80 -> cslpull80 2025-12-04T09:43:33.0227989Z * [new tag] cslpull81 -> cslpull81 2025-12-04T09:43:33.0229296Z * [new tag] cslpull82 -> cslpull82 2025-12-04T09:43:33.0230695Z * [new tag] cslpull83 -> cslpull83 2025-12-04T09:43:33.0231830Z * [new tag] cslpull84 -> cslpull84 2025-12-04T09:43:33.0233196Z * [new tag] cslpull85 -> cslpull85 2025-12-04T09:43:33.0234494Z * [new tag] cslpull86 -> cslpull86 2025-12-04T09:43:33.0235969Z * [new tag] cslpull87 -> cslpull87 2025-12-04T09:43:33.0237229Z * [new tag] cslpull88 -> cslpull88 2025-12-04T09:43:33.0238502Z * [new tag] cslpull89 -> cslpull89 2025-12-04T09:43:33.0239500Z * [new tag] cslpull90 -> cslpull90 2025-12-04T09:43:33.0241250Z * [new tag] cslpull91 -> cslpull91 2025-12-04T09:43:33.0242458Z * [new tag] cslpull92 -> cslpull92 2025-12-04T09:43:33.0243860Z * [new tag] flight_5 -> flight_5 2025-12-04T09:43:33.0245338Z * [new tag] flight_5.1 -> flight_5.1 2025-12-04T09:43:33.0246754Z * [new tag] flight_5.2 -> flight_5.2 2025-12-04T09:43:33.0248059Z * [new tag] flight_5.3 -> flight_5.3 2025-12-04T09:43:33.0249373Z * [new tag] forpull1 -> forpull1 2025-12-04T09:43:33.0250918Z * [new tag] malfet/tag-2ef5611 -> malfet/tag-2ef5611 2025-12-04T09:43:33.0252302Z * [new tag] malfet/tag-317b1a0 -> malfet/tag-317b1a0 2025-12-04T09:43:33.0253583Z * [new tag] malfet/tag-ec6f767 -> malfet/tag-ec6f767 2025-12-04T09:43:33.0254957Z * [new tag] nightly-binary -> nightly-binary 2025-12-04T09:43:33.0256500Z * [new tag] sqzhang_flight4_plus -> sqzhang_flight4_plus 2025-12-04T09:43:33.0257926Z * [new tag] sqzhang_flight_3 -> sqzhang_flight_3 2025-12-04T09:43:33.0259544Z * [new tag] trunk/02d8bd6974cf84b721680d773dbdb1b6f40ce272 -> trunk/02d8bd6974cf84b721680d773dbdb1b6f40ce272 2025-12-04T09:43:33.0260918Z * [new tag] trunk/066997fb38ade71e00d78e9d572e380b5f02bd3e -> trunk/066997fb38ade71e00d78e9d572e380b5f02bd3e 2025-12-04T09:43:33.0262586Z * [new tag] trunk/076e7b19fa1d481ad778d06d2b49ba57d3ce8c88 -> trunk/076e7b19fa1d481ad778d06d2b49ba57d3ce8c88 2025-12-04T09:43:33.0264110Z * [new tag] trunk/07dcc0b83db3211653a38565a24e15acdba75654 -> trunk/07dcc0b83db3211653a38565a24e15acdba75654 2025-12-04T09:43:33.0265439Z * [new tag] trunk/082e96b68dfcd16cab7cfafc4d3d055767dab3eb -> trunk/082e96b68dfcd16cab7cfafc4d3d055767dab3eb 2025-12-04T09:43:33.0267313Z * [new tag] trunk/088048f2fea28ff7d450f65c72419ca45780d30b -> trunk/088048f2fea28ff7d450f65c72419ca45780d30b 2025-12-04T09:43:33.0268954Z * [new tag] trunk/09076941a95c76f4d9ad189d064dfd8baa39e672 -> trunk/09076941a95c76f4d9ad189d064dfd8baa39e672 2025-12-04T09:43:33.0270114Z * [new tag] trunk/0b80a4c62b94402844bf221791c096b0035c6d75 -> trunk/0b80a4c62b94402844bf221791c096b0035c6d75 2025-12-04T09:43:33.0271786Z * [new tag] trunk/0bbbdf1750567a980634ad907a325357ba8ba8f2 -> trunk/0bbbdf1750567a980634ad907a325357ba8ba8f2 2025-12-04T09:43:33.0273264Z * [new tag] trunk/0c281dd78773b2bc17c58ead0e4cd4ac46e775c5 -> trunk/0c281dd78773b2bc17c58ead0e4cd4ac46e775c5 2025-12-04T09:43:33.0274400Z * [new tag] trunk/135f3753c418a6879b1954904184937b67e61688 -> trunk/135f3753c418a6879b1954904184937b67e61688 2025-12-04T09:43:33.0276233Z * [new tag] trunk/15da21026cb13cd20257dc9e96830db108743c10 -> trunk/15da21026cb13cd20257dc9e96830db108743c10 2025-12-04T09:43:33.0278674Z * [new tag] trunk/166efdad2ac827f30fb02504c6017520257f88ec -> trunk/166efdad2ac827f30fb02504c6017520257f88ec 2025-12-04T09:43:33.0279172Z * [new tag] trunk/174272c15fae553d8488140af931f7d8050a313f -> trunk/174272c15fae553d8488140af931f7d8050a313f 2025-12-04T09:43:33.0280394Z * [new tag] trunk/18f3ca08f13b8de61307f5e8cd7d4cccb67e9d11 -> trunk/18f3ca08f13b8de61307f5e8cd7d4cccb67e9d11 2025-12-04T09:43:33.0281219Z * [new tag] trunk/1902eddfe655a15ebcf2c72bd81ade110fdeef63 -> trunk/1902eddfe655a15ebcf2c72bd81ade110fdeef63 2025-12-04T09:43:33.0282761Z * [new tag] trunk/195f92e98d3d66738577f11f22c4b5c8a1c76dd5 -> trunk/195f92e98d3d66738577f11f22c4b5c8a1c76dd5 2025-12-04T09:43:33.0283918Z * [new tag] trunk/1aa13e17de39e3c768ea7aebaad166ce72a06676 -> trunk/1aa13e17de39e3c768ea7aebaad166ce72a06676 2025-12-04T09:43:33.0285397Z * [new tag] trunk/1afe2832f58e24e54a5bfda5a5afa9b96fdea40e -> trunk/1afe2832f58e24e54a5bfda5a5afa9b96fdea40e 2025-12-04T09:43:33.0286737Z * [new tag] trunk/1c87554d74140eaee964ca8b1832cede67f5f520 -> trunk/1c87554d74140eaee964ca8b1832cede67f5f520 2025-12-04T09:43:33.0288192Z * [new tag] trunk/1ccb743b7b5be955f49736c162c4f5004b8a0dd8 -> trunk/1ccb743b7b5be955f49736c162c4f5004b8a0dd8 2025-12-04T09:43:33.0289586Z * [new tag] trunk/1cee47d6ce0a02227185b566593f002dd639ca0c -> trunk/1cee47d6ce0a02227185b566593f002dd639ca0c 2025-12-04T09:43:33.0290691Z * [new tag] trunk/1d21b4df2babe322e5d085ceb6de884eb260a62d -> trunk/1d21b4df2babe322e5d085ceb6de884eb260a62d 2025-12-04T09:43:33.0292217Z * [new tag] trunk/1e34fb2550e4aa650314f7a6d9f6daf4da7478a8 -> trunk/1e34fb2550e4aa650314f7a6d9f6daf4da7478a8 2025-12-04T09:43:33.0293592Z * [new tag] trunk/1e526fb5b1d93bfc70691c5c3955fdffc1b7b7de -> trunk/1e526fb5b1d93bfc70691c5c3955fdffc1b7b7de 2025-12-04T09:43:33.0294900Z * [new tag] trunk/1ee32a8b1f554a312d79bad01ded24f38cd95543 -> trunk/1ee32a8b1f554a312d79bad01ded24f38cd95543 2025-12-04T09:43:33.0296324Z * [new tag] trunk/201e2c4117eb9744594dad6a5c18213d7b4705d7 -> trunk/201e2c4117eb9744594dad6a5c18213d7b4705d7 2025-12-04T09:43:33.0297602Z * [new tag] trunk/2353a0f60eb4b4cb6675907a7fa9fbedc1c02e7f -> trunk/2353a0f60eb4b4cb6675907a7fa9fbedc1c02e7f 2025-12-04T09:43:33.0299029Z * [new tag] trunk/285779b1621cf9f073a062b0889a642d200308d9 -> trunk/285779b1621cf9f073a062b0889a642d200308d9 2025-12-04T09:43:33.0300285Z * [new tag] trunk/2887faaec6295d081580d09fce161201826c6d87 -> trunk/2887faaec6295d081580d09fce161201826c6d87 2025-12-04T09:43:33.0301649Z * [new tag] trunk/296e67c92635443c67b11c0ae1bd045f03ebb7bc -> trunk/296e67c92635443c67b11c0ae1bd045f03ebb7bc 2025-12-04T09:43:33.0303100Z * [new tag] trunk/29856679769b3dede478767e2fe6cfb51197cb25 -> trunk/29856679769b3dede478767e2fe6cfb51197cb25 2025-12-04T09:43:33.0304475Z * [new tag] trunk/29e5455a4740c326ab187c7aa7b5ef98034ea563 -> trunk/29e5455a4740c326ab187c7aa7b5ef98034ea563 2025-12-04T09:43:33.0305839Z * [new tag] trunk/2ac3ef882afb23136adc188975f0a8802fc68adf -> trunk/2ac3ef882afb23136adc188975f0a8802fc68adf 2025-12-04T09:43:33.0307081Z * [new tag] trunk/2bec68e73b64715354af076ad309335f943e36cd -> trunk/2bec68e73b64715354af076ad309335f943e36cd 2025-12-04T09:43:33.0308563Z * [new tag] trunk/2c87367e6f88662cd5cedbd1537748b7948c38e1 -> trunk/2c87367e6f88662cd5cedbd1537748b7948c38e1 2025-12-04T09:43:33.0310000Z * [new tag] trunk/2d1f78fe3ec13820f136a2e0336da12a25f41708 -> trunk/2d1f78fe3ec13820f136a2e0336da12a25f41708 2025-12-04T09:43:33.0311344Z * [new tag] trunk/2df6058f116a65722a0e03073402feb242572d35 -> trunk/2df6058f116a65722a0e03073402feb242572d35 2025-12-04T09:43:33.0312780Z * [new tag] trunk/2e0c2e170fe658c440775c8e5c44228aafcc47ec -> trunk/2e0c2e170fe658c440775c8e5c44228aafcc47ec 2025-12-04T09:43:33.0314312Z * [new tag] trunk/2f9b7dad7b5419b063bd0f2e204de192720ebb94 -> trunk/2f9b7dad7b5419b063bd0f2e204de192720ebb94 2025-12-04T09:43:33.0315656Z * [new tag] trunk/305168768a95d69c444df5cd334bb774edfe06f1 -> trunk/305168768a95d69c444df5cd334bb774edfe06f1 2025-12-04T09:43:33.0316989Z * [new tag] trunk/31fc12773026e8e00f054dd79ad9b2491e693b48 -> trunk/31fc12773026e8e00f054dd79ad9b2491e693b48 2025-12-04T09:43:33.0318322Z * [new tag] trunk/320de0c6b0a3e7c6d2693ea5c28d5d0156ba7991 -> trunk/320de0c6b0a3e7c6d2693ea5c28d5d0156ba7991 2025-12-04T09:43:33.0319669Z * [new tag] trunk/3418bd29475dff06695045fcdf93e7d0dac67da8 -> trunk/3418bd29475dff06695045fcdf93e7d0dac67da8 2025-12-04T09:43:33.0321020Z * [new tag] trunk/34a98608afa0cb5b48f0d6d30432fdd0a2614ddf -> trunk/34a98608afa0cb5b48f0d6d30432fdd0a2614ddf 2025-12-04T09:43:33.0322292Z * [new tag] trunk/35b7a9a26c5923d98aebaa41a031dae21788a9ee -> trunk/35b7a9a26c5923d98aebaa41a031dae21788a9ee 2025-12-04T09:43:33.0323716Z * [new tag] trunk/39d07dbf03a911bdd45d1af78d8638dc92074938 -> trunk/39d07dbf03a911bdd45d1af78d8638dc92074938 2025-12-04T09:43:33.0324960Z * [new tag] trunk/3cd98b4205ada151042cc7ff097a82d4a4b18725 -> trunk/3cd98b4205ada151042cc7ff097a82d4a4b18725 2025-12-04T09:43:33.0326287Z * [new tag] trunk/3d35fd20a78ff4d016fa80f4e5fad37191d7bcae -> trunk/3d35fd20a78ff4d016fa80f4e5fad37191d7bcae 2025-12-04T09:43:33.0327632Z * [new tag] trunk/409a5fee945c46a3edaf5df162812f201bfd7b2f -> trunk/409a5fee945c46a3edaf5df162812f201bfd7b2f 2025-12-04T09:43:33.0328992Z * [new tag] trunk/42e9005cda22da3f1c559c3649218cebd671027c -> trunk/42e9005cda22da3f1c559c3649218cebd671027c 2025-12-04T09:43:33.0330360Z * [new tag] trunk/43b94713bbf340d3c124fde02d0f73add4021247 -> trunk/43b94713bbf340d3c124fde02d0f73add4021247 2025-12-04T09:43:33.0331766Z * [new tag] trunk/44ac69388a4a5eb463dbd2a13f00d1e3b924566c -> trunk/44ac69388a4a5eb463dbd2a13f00d1e3b924566c 2025-12-04T09:43:33.0333092Z * [new tag] trunk/45d14e2497292be06ad36eaa1aaaf7c630a2586a -> trunk/45d14e2497292be06ad36eaa1aaaf7c630a2586a 2025-12-04T09:43:33.0334402Z * [new tag] trunk/45d310ad84854dff730c0b12e577d7998d978686 -> trunk/45d310ad84854dff730c0b12e577d7998d978686 2025-12-04T09:43:33.0335941Z * [new tag] trunk/47b28ddf7bd74b50fa93b307a7d3b183a6d77f54 -> trunk/47b28ddf7bd74b50fa93b307a7d3b183a6d77f54 2025-12-04T09:43:33.0337198Z * [new tag] trunk/481e5ab336275bd3acd5fa8a611b05b4469012af -> trunk/481e5ab336275bd3acd5fa8a611b05b4469012af 2025-12-04T09:43:33.0338535Z * [new tag] trunk/491731647f6b8a9345dcfb3bc9416aea254a7d96 -> trunk/491731647f6b8a9345dcfb3bc9416aea254a7d96 2025-12-04T09:43:33.0339857Z * [new tag] trunk/49a04d26088acc17d948ddd66920f3e16371e873 -> trunk/49a04d26088acc17d948ddd66920f3e16371e873 2025-12-04T09:43:33.0341165Z * [new tag] trunk/4bebc827c47d2f1f0fa1a417a5201a97aef3d985 -> trunk/4bebc827c47d2f1f0fa1a417a5201a97aef3d985 2025-12-04T09:43:33.0342570Z * [new tag] trunk/4c246677784c6a14bc2dbb9ff8773ef0a3a3222f -> trunk/4c246677784c6a14bc2dbb9ff8773ef0a3a3222f 2025-12-04T09:43:33.0344132Z * [new tag] trunk/4cfb47ff548b6d996641058cf04a70e311a4c3aa -> trunk/4cfb47ff548b6d996641058cf04a70e311a4c3aa 2025-12-04T09:43:33.0345595Z * [new tag] trunk/4e0061c1aa52f606dda8cfab0bd7591e588faf2c -> trunk/4e0061c1aa52f606dda8cfab0bd7591e588faf2c 2025-12-04T09:43:33.0347353Z * [new tag] trunk/4fefb8e7e942386ffac764a41b232241f82bea3a -> trunk/4fefb8e7e942386ffac764a41b232241f82bea3a 2025-12-04T09:43:33.0348753Z * [new tag] trunk/503b2640023521f5a35cd9a52fc8033d73a95d0d -> trunk/503b2640023521f5a35cd9a52fc8033d73a95d0d 2025-12-04T09:43:33.0350159Z * [new tag] trunk/518c2b1b3dab9a2ef2849e04b3bc2f20c1c41db9 -> trunk/518c2b1b3dab9a2ef2849e04b3bc2f20c1c41db9 2025-12-04T09:43:33.0351488Z * [new tag] trunk/5191b2fa68ba19960912bfd7fd721c79d76bb1f3 -> trunk/5191b2fa68ba19960912bfd7fd721c79d76bb1f3 2025-12-04T09:43:33.0352938Z * [new tag] trunk/52ac0f0dc4acacd219f1317fbc28ec631c01e07a -> trunk/52ac0f0dc4acacd219f1317fbc28ec631c01e07a 2025-12-04T09:43:33.0354295Z * [new tag] trunk/539ba711b029de9f191070f4f0d12f18f5b7f292 -> trunk/539ba711b029de9f191070f4f0d12f18f5b7f292 2025-12-04T09:43:33.0356324Z * [new tag] trunk/556375b55deebebbc56cb7aef81f4d52f031ba28 -> trunk/556375b55deebebbc56cb7aef81f4d52f031ba28 2025-12-04T09:43:33.0357909Z * [new tag] trunk/55c4ab554845481d0a69a3811937575fe8bb1a66 -> trunk/55c4ab554845481d0a69a3811937575fe8bb1a66 2025-12-04T09:43:33.0359198Z * [new tag] trunk/5634469fda9e5d98869c82c7d03bb08914245f96 -> trunk/5634469fda9e5d98869c82c7d03bb08914245f96 2025-12-04T09:43:33.0360423Z * [new tag] trunk/5778f6ff894686a975a9a23645178ae4c87ad5dc -> trunk/5778f6ff894686a975a9a23645178ae4c87ad5dc 2025-12-04T09:43:33.0361800Z * [new tag] trunk/587d63a3e07de5dc91065f9ef70bcacda9989068 -> trunk/587d63a3e07de5dc91065f9ef70bcacda9989068 2025-12-04T09:43:33.0363162Z * [new tag] trunk/597930f6b568852356ca9795dac76f9e4653adbd -> trunk/597930f6b568852356ca9795dac76f9e4653adbd 2025-12-04T09:43:33.0364542Z * [new tag] trunk/597df3a4e2a67b9fdbe1a89b2f4d74f822274db6 -> trunk/597df3a4e2a67b9fdbe1a89b2f4d74f822274db6 2025-12-04T09:43:33.0366036Z * [new tag] trunk/59abd50e931f4efb21b053f7a2911f5d8a49d883 -> trunk/59abd50e931f4efb21b053f7a2911f5d8a49d883 2025-12-04T09:43:33.0367404Z * [new tag] trunk/5a607febc04c3a2b5824c75f3f60307867439a2c -> trunk/5a607febc04c3a2b5824c75f3f60307867439a2c 2025-12-04T09:43:33.0368789Z * [new tag] trunk/5bf1cdf4755c54ef462b44cb8041b0a57311556b -> trunk/5bf1cdf4755c54ef462b44cb8041b0a57311556b 2025-12-04T09:43:33.0370026Z * [new tag] trunk/5f0030ba63d334d7e8c93a09e41403b89e4c573c -> trunk/5f0030ba63d334d7e8c93a09e41403b89e4c573c 2025-12-04T09:43:33.0371415Z * [new tag] trunk/5f21d27e71268464d362a96c9ac09ea475f7f202 -> trunk/5f21d27e71268464d362a96c9ac09ea475f7f202 2025-12-04T09:43:33.0372807Z * [new tag] trunk/5fafc13038c9988d9ac21fa793fbd5890604b447 -> trunk/5fafc13038c9988d9ac21fa793fbd5890604b447 2025-12-04T09:43:33.0374234Z * [new tag] trunk/61be54a31dc09b59d99b62176fb935aee0b924ef -> trunk/61be54a31dc09b59d99b62176fb935aee0b924ef 2025-12-04T09:43:33.0375599Z * [new tag] trunk/62d3ccd71484ed6a760d909b41487101bbc65719 -> trunk/62d3ccd71484ed6a760d909b41487101bbc65719 2025-12-04T09:43:33.0376966Z * [new tag] trunk/641cdb68ae27668eb441d0e49c87a0602c120c2b -> trunk/641cdb68ae27668eb441d0e49c87a0602c120c2b 2025-12-04T09:43:33.0378324Z * [new tag] trunk/65c4620d6bb0c6029f69762c22b91dda2294da9a -> trunk/65c4620d6bb0c6029f69762c22b91dda2294da9a 2025-12-04T09:43:33.0379691Z * [new tag] trunk/66004b993744b4106bf8afaba71f3c228a804206 -> trunk/66004b993744b4106bf8afaba71f3c228a804206 2025-12-04T09:43:33.0381101Z * [new tag] trunk/6658a04c7ca67acb64512341342e7b3ee13ee386 -> trunk/6658a04c7ca67acb64512341342e7b3ee13ee386 2025-12-04T09:43:33.0382452Z * [new tag] trunk/6864e309092a71f8ab0ca6a4dc7f8a4073fd31c4 -> trunk/6864e309092a71f8ab0ca6a4dc7f8a4073fd31c4 2025-12-04T09:43:33.0383930Z * [new tag] trunk/6c261c6cb07892c90ca19ed51c9705b1659a3f7d -> trunk/6c261c6cb07892c90ca19ed51c9705b1659a3f7d 2025-12-04T09:43:33.0385183Z * [new tag] trunk/6c8b6a043f1628188b6396b3a2a6e000ca68362b -> trunk/6c8b6a043f1628188b6396b3a2a6e000ca68362b 2025-12-04T09:43:33.0386506Z * [new tag] trunk/6ceb4a32f92ae67ce5d7d97931d17401ebf5ffa5 -> trunk/6ceb4a32f92ae67ce5d7d97931d17401ebf5ffa5 2025-12-04T09:43:33.0387966Z * [new tag] trunk/6e404e9b7d6f5fb0de86aa73888c3038248c17f8 -> trunk/6e404e9b7d6f5fb0de86aa73888c3038248c17f8 2025-12-04T09:43:33.0389398Z * [new tag] trunk/6ec30b490aee1db6bcdc7340abddef25784f08ec -> trunk/6ec30b490aee1db6bcdc7340abddef25784f08ec 2025-12-04T09:43:33.0390776Z * [new tag] trunk/6f2783a6c08e1db34275ff25176ffe9aebc30a71 -> trunk/6f2783a6c08e1db34275ff25176ffe9aebc30a71 2025-12-04T09:43:33.0392147Z * [new tag] trunk/6f53fefeb90ad3281119b5cfc4aa9ffd8a066e3d -> trunk/6f53fefeb90ad3281119b5cfc4aa9ffd8a066e3d 2025-12-04T09:43:33.0393521Z * [new tag] trunk/6f7dcf51e46d0c880db1a2f5c70de57adb576f4a -> trunk/6f7dcf51e46d0c880db1a2f5c70de57adb576f4a 2025-12-04T09:43:33.0394922Z * [new tag] trunk/6ff831180d2fa436c7f1c1af3adac641fce9d60e -> trunk/6ff831180d2fa436c7f1c1af3adac641fce9d60e 2025-12-04T09:43:33.0396270Z * [new tag] trunk/70076464a63ab218a7ceefb0e76ccd7131deb8f8 -> trunk/70076464a63ab218a7ceefb0e76ccd7131deb8f8 2025-12-04T09:43:33.0397607Z * [new tag] trunk/70d797a5fc109b20a517646fcaa819477cd0d485 -> trunk/70d797a5fc109b20a517646fcaa819477cd0d485 2025-12-04T09:43:33.0398964Z * [new tag] trunk/7348cb355ff0a6f79cd4871215aea72185748734 -> trunk/7348cb355ff0a6f79cd4871215aea72185748734 2025-12-04T09:43:33.0400373Z * [new tag] trunk/74fe26a1ebe32931783569f2e762e3c2c974901f -> trunk/74fe26a1ebe32931783569f2e762e3c2c974901f 2025-12-04T09:43:33.0401887Z * [new tag] trunk/76aeb8c7e0f795b3fddca134cbea9a69da3ee696 -> trunk/76aeb8c7e0f795b3fddca134cbea9a69da3ee696 2025-12-04T09:43:33.0403182Z * [new tag] trunk/7716da9fb23f27a65b41f9f016a2afadf281c18f -> trunk/7716da9fb23f27a65b41f9f016a2afadf281c18f 2025-12-04T09:43:33.0404493Z * [new tag] trunk/7741edd4ed665f3988052e260863efb508d61a03 -> trunk/7741edd4ed665f3988052e260863efb508d61a03 2025-12-04T09:43:33.0405900Z * [new tag] trunk/78adb3b3df41b45d2368b67226d2f864b78939a6 -> trunk/78adb3b3df41b45d2368b67226d2f864b78939a6 2025-12-04T09:43:33.0407314Z * [new tag] trunk/79d7b178225e5ed24d4e1db74e5abbff848f5fb7 -> trunk/79d7b178225e5ed24d4e1db74e5abbff848f5fb7 2025-12-04T09:43:33.0408621Z * [new tag] trunk/7a1e316115fc6996b3f2336822ba5d5f6179f0c3 -> trunk/7a1e316115fc6996b3f2336822ba5d5f6179f0c3 2025-12-04T09:43:33.0409906Z * [new tag] trunk/7a41b66367c38d0af3e8a90f7be48d6b281e7bca -> trunk/7a41b66367c38d0af3e8a90f7be48d6b281e7bca 2025-12-04T09:43:33.0411238Z * [new tag] trunk/7b7af390ea8541c611d1ce2018a6934188fc197b -> trunk/7b7af390ea8541c611d1ce2018a6934188fc197b 2025-12-04T09:43:33.0412597Z * [new tag] trunk/7ba4680f3755a560af81aa0f688791e367aa3609 -> trunk/7ba4680f3755a560af81aa0f688791e367aa3609 2025-12-04T09:43:33.0414070Z * [new tag] trunk/7bc2a66ded06a0b2549aa51d807edc5dc3e73d1b -> trunk/7bc2a66ded06a0b2549aa51d807edc5dc3e73d1b 2025-12-04T09:43:33.0415319Z * [new tag] trunk/7c648509a7470ace9fb2bae960dd4790f7e943e9 -> trunk/7c648509a7470ace9fb2bae960dd4790f7e943e9 2025-12-04T09:43:33.0416586Z * [new tag] trunk/7cbc2d034cecd21ab5c9707d0a9c525c17143fb8 -> trunk/7cbc2d034cecd21ab5c9707d0a9c525c17143fb8 2025-12-04T09:43:33.0417951Z * [new tag] trunk/7d1bbaf4ba301ea3fba6f3c7bc02d58f6417aaed -> trunk/7d1bbaf4ba301ea3fba6f3c7bc02d58f6417aaed 2025-12-04T09:43:33.0419413Z * [new tag] trunk/7d2a33e4ebf60b217a3cd77feae19231eb996fc8 -> trunk/7d2a33e4ebf60b217a3cd77feae19231eb996fc8 2025-12-04T09:43:33.0420689Z * [new tag] trunk/7eb625920054b1126a7d2d99818aaa188c6ba95e -> trunk/7eb625920054b1126a7d2d99818aaa188c6ba95e 2025-12-04T09:43:33.0421959Z * [new tag] trunk/7f55ba19c456a3d6cc443dd9edb6bb7cca677ead -> trunk/7f55ba19c456a3d6cc443dd9edb6bb7cca677ead 2025-12-04T09:43:33.0423324Z * [new tag] trunk/81af382128efa094d8702e18f2c133760904c718 -> trunk/81af382128efa094d8702e18f2c133760904c718 2025-12-04T09:43:33.0424939Z * [new tag] trunk/84149583d483e9c973c9a0feda70e4f3964947b0 -> trunk/84149583d483e9c973c9a0feda70e4f3964947b0 2025-12-04T09:43:33.0426435Z * [new tag] trunk/85a315917efe82c24306be805c584ec044951c75 -> trunk/85a315917efe82c24306be805c584ec044951c75 2025-12-04T09:43:33.0427909Z * [new tag] trunk/87329491c82a5f8c1cc4ec11d8f55a5de2551ece -> trunk/87329491c82a5f8c1cc4ec11d8f55a5de2551ece 2025-12-04T09:43:33.0429196Z * [new tag] trunk/892640e25aeefa8007c5af837214b4502b6b62a6 -> trunk/892640e25aeefa8007c5af837214b4502b6b62a6 2025-12-04T09:43:33.0430692Z * [new tag] trunk/89e3bbcb5b5321dc8b9520b4d5a8ee60cea1d0b4 -> trunk/89e3bbcb5b5321dc8b9520b4d5a8ee60cea1d0b4 2025-12-04T09:43:33.0432049Z * [new tag] trunk/8c73bbbb02159223c0c97d268a0a74cb78158a1c -> trunk/8c73bbbb02159223c0c97d268a0a74cb78158a1c 2025-12-04T09:43:33.0433414Z * [new tag] trunk/8d56e98c8db988a22cb2dfaeefb30bc7d2a3cc43 -> trunk/8d56e98c8db988a22cb2dfaeefb30bc7d2a3cc43 2025-12-04T09:43:33.0434888Z * [new tag] trunk/8d9dd9603e5ee26c01007f0cd4f018e584840922 -> trunk/8d9dd9603e5ee26c01007f0cd4f018e584840922 2025-12-04T09:43:33.0436262Z * [new tag] trunk/8ef0c0b02b062d75e7c9be2594914a3e784d23ca -> trunk/8ef0c0b02b062d75e7c9be2594914a3e784d23ca 2025-12-04T09:43:33.0437643Z * [new tag] trunk/90b27e7e8352cde97d32ddad24740ef819633f38 -> trunk/90b27e7e8352cde97d32ddad24740ef819633f38 2025-12-04T09:43:33.0438905Z * [new tag] trunk/90f0139e64b2951815d524b6a373bed20c4fbf90 -> trunk/90f0139e64b2951815d524b6a373bed20c4fbf90 2025-12-04T09:43:33.0440176Z * [new tag] trunk/93d0d6838c56af59b0dba794e6aa08f0c1c7799c -> trunk/93d0d6838c56af59b0dba794e6aa08f0c1c7799c 2025-12-04T09:43:33.0441645Z * [new tag] trunk/94ca8d5f1e81fea3ae488650a0fb6795049a9f87 -> trunk/94ca8d5f1e81fea3ae488650a0fb6795049a9f87 2025-12-04T09:43:33.0442999Z * [new tag] trunk/9844fbeadd5cebdf1281d6fbf79164139c352693 -> trunk/9844fbeadd5cebdf1281d6fbf79164139c352693 2025-12-04T09:43:33.0444797Z * [new tag] trunk/99024dec888ec1e50b546822a32b6fb2f35e5eaa -> trunk/99024dec888ec1e50b546822a32b6fb2f35e5eaa 2025-12-04T09:43:33.0446278Z * [new tag] trunk/9a296e640fc88aa44d275b48cd9cc30c573b169d -> trunk/9a296e640fc88aa44d275b48cd9cc30c573b169d 2025-12-04T09:43:33.0447724Z * [new tag] trunk/9b3e34d8589b29f7b4e7fab6f78711b7ca6e4639 -> trunk/9b3e34d8589b29f7b4e7fab6f78711b7ca6e4639 2025-12-04T09:43:33.0449091Z * [new tag] trunk/9cd055e547e9b67a5f9827f8999c38d7eda1bcb8 -> trunk/9cd055e547e9b67a5f9827f8999c38d7eda1bcb8 2025-12-04T09:43:33.0450483Z * [new tag] trunk/9f0df5686cb4ada94f94620acba2e3c3f363b11d -> trunk/9f0df5686cb4ada94f94620acba2e3c3f363b11d 2025-12-04T09:43:33.0451887Z * [new tag] trunk/9f7fceb887d0cfa0326a59b887821c63ff11340a -> trunk/9f7fceb887d0cfa0326a59b887821c63ff11340a 2025-12-04T09:43:33.0453275Z * [new tag] trunk/9f8ef8855d3078d70f7b782540ff2aaf158d6742 -> trunk/9f8ef8855d3078d70f7b782540ff2aaf158d6742 2025-12-04T09:43:33.0454772Z * [new tag] trunk/9fb52efc797b47a1f425a03aa5e47b866d8b1098 -> trunk/9fb52efc797b47a1f425a03aa5e47b866d8b1098 2025-12-04T09:43:33.0456347Z * [new tag] trunk/9ff4a2ebc5762d46c73e46b1b523d7ff349fedfa -> trunk/9ff4a2ebc5762d46c73e46b1b523d7ff349fedfa 2025-12-04T09:43:33.0457918Z * [new tag] trunk/a0f3937b94422354538ebbd47202d5b0e8a3fd0d -> trunk/a0f3937b94422354538ebbd47202d5b0e8a3fd0d 2025-12-04T09:43:33.0459189Z * [new tag] trunk/a15066c28b3145e6edbfc88359d0411d14cfc70c -> trunk/a15066c28b3145e6edbfc88359d0411d14cfc70c 2025-12-04T09:43:33.0460538Z * [new tag] trunk/a20f775e82564d2a9979221ed7f3b8d7cf54ce90 -> trunk/a20f775e82564d2a9979221ed7f3b8d7cf54ce90 2025-12-04T09:43:33.0461954Z * [new tag] trunk/a2973fb00ec002dd4b6bbf07385f066efb259b8c -> trunk/a2973fb00ec002dd4b6bbf07385f066efb259b8c 2025-12-04T09:43:33.0463224Z * [new tag] trunk/a7dc6dab9ad911259d4801c502907e531594db45 -> trunk/a7dc6dab9ad911259d4801c502907e531594db45 2025-12-04T09:43:33.0464658Z * [new tag] trunk/a951a9cee65c01660bbc6e6fded90ecb10fa6109 -> trunk/a951a9cee65c01660bbc6e6fded90ecb10fa6109 2025-12-04T09:43:33.0466084Z * [new tag] trunk/abfa1a6d65c7c159e35c72c25979b9da4971689e -> trunk/abfa1a6d65c7c159e35c72c25979b9da4971689e 2025-12-04T09:43:33.0467596Z * [new tag] trunk/ae3a2395bf66151078e2d201716f7d63ce1c6f3e -> trunk/ae3a2395bf66151078e2d201716f7d63ce1c6f3e 2025-12-04T09:43:33.0468952Z * [new tag] trunk/afdff7f0325080dedac44d080cb5a3b0e65e6c5e -> trunk/afdff7f0325080dedac44d080cb5a3b0e65e6c5e 2025-12-04T09:43:33.0470213Z * [new tag] trunk/b1aed4e7a72c03a38f44543aaea0dae2e9b76d48 -> trunk/b1aed4e7a72c03a38f44543aaea0dae2e9b76d48 2025-12-04T09:43:33.0471589Z * [new tag] trunk/b1decff555cd50e2123c8c6e25cc0d447c411f62 -> trunk/b1decff555cd50e2123c8c6e25cc0d447c411f62 2025-12-04T09:43:33.0473045Z * [new tag] trunk/b2b6b034c9fd08672c40e63ef243556ad4c49bd2 -> trunk/b2b6b034c9fd08672c40e63ef243556ad4c49bd2 2025-12-04T09:43:33.0474455Z * [new tag] trunk/b39813b4a04931682b0491adba2138d01d716d99 -> trunk/b39813b4a04931682b0491adba2138d01d716d99 2025-12-04T09:43:33.0475872Z * [new tag] trunk/b3a7edb2311367974cc7cd764cfb11a5d6758b24 -> trunk/b3a7edb2311367974cc7cd764cfb11a5d6758b24 2025-12-04T09:43:33.0477332Z * [new tag] trunk/b4cc1329c86acaef6d42c1fac7169b8d870ab0d7 -> trunk/b4cc1329c86acaef6d42c1fac7169b8d870ab0d7 2025-12-04T09:43:33.0478775Z * [new tag] trunk/b555c39217f765759954a4f9f9bd1e9b87bed11a -> trunk/b555c39217f765759954a4f9f9bd1e9b87bed11a 2025-12-04T09:43:33.0480182Z * [new tag] trunk/b6b6c80379388b7f9932c3e6a0f9907bf430e417 -> trunk/b6b6c80379388b7f9932c3e6a0f9907bf430e417 2025-12-04T09:43:33.0481658Z * [new tag] trunk/b6b6d912df0b6f4082f8e50b18bd1de1dd7325f4 -> trunk/b6b6d912df0b6f4082f8e50b18bd1de1dd7325f4 2025-12-04T09:43:33.0483120Z * [new tag] trunk/b7d60685f8cbc939b68a20871e90db67e729329b -> trunk/b7d60685f8cbc939b68a20871e90db67e729329b 2025-12-04T09:43:33.0484523Z * [new tag] trunk/b7f6b9a4fc6259f7af068f31868b3119bb1bac3e -> trunk/b7f6b9a4fc6259f7af068f31868b3119bb1bac3e 2025-12-04T09:43:33.0486000Z * [new tag] trunk/b8c4ba3593761e7b2a3ebd86f040fb07b47c02cf -> trunk/b8c4ba3593761e7b2a3ebd86f040fb07b47c02cf 2025-12-04T09:43:33.0487326Z * [new tag] trunk/b9c8f3a4884befb965ff42620ce44a71b04887f5 -> trunk/b9c8f3a4884befb965ff42620ce44a71b04887f5 2025-12-04T09:43:33.0488788Z * [new tag] trunk/ba1412546f3082c0958c077acc2025e4dbc33f1f -> trunk/ba1412546f3082c0958c077acc2025e4dbc33f1f 2025-12-04T09:43:33.0490318Z * [new tag] trunk/bac403c0b38c63bdbcc0c31f1c2b0bc0260f610f -> trunk/bac403c0b38c63bdbcc0c31f1c2b0bc0260f610f 2025-12-04T09:43:33.0491803Z * [new tag] trunk/bb3034198b459401fabeab254e1b99f0115046e2 -> trunk/bb3034198b459401fabeab254e1b99f0115046e2 2025-12-04T09:43:33.0493206Z * [new tag] trunk/bc39b2b3bc7a6e19a42e62bd576974035086fe55 -> trunk/bc39b2b3bc7a6e19a42e62bd576974035086fe55 2025-12-04T09:43:33.0494796Z * [new tag] trunk/bc43d5b297f207a11d83d77ddf0152bdaabe15a8 -> trunk/bc43d5b297f207a11d83d77ddf0152bdaabe15a8 2025-12-04T09:43:33.0496186Z * [new tag] trunk/bc6a4863c7246a6493d16d4ea6eee71ec07c6a09 -> trunk/bc6a4863c7246a6493d16d4ea6eee71ec07c6a09 2025-12-04T09:43:33.0497603Z * [new tag] trunk/bea4912944defdbcb8b061800caab6cbbbd01df5 -> trunk/bea4912944defdbcb8b061800caab6cbbbd01df5 2025-12-04T09:43:33.0499257Z * [new tag] trunk/c04e2c656f48d82d1521b867bbbf03967b9b7564 -> trunk/c04e2c656f48d82d1521b867bbbf03967b9b7564 2025-12-04T09:43:33.0500694Z * [new tag] trunk/c0660bcee27e7d7731634e274576a7081882bede -> trunk/c0660bcee27e7d7731634e274576a7081882bede 2025-12-04T09:43:33.0502104Z * [new tag] trunk/c178ed43d3d99cbefe84fbfb21d6f282b20d62ac -> trunk/c178ed43d3d99cbefe84fbfb21d6f282b20d62ac 2025-12-04T09:43:33.0503511Z * [new tag] trunk/c55b1e8f61d041ee436d697449eb028931d574fb -> trunk/c55b1e8f61d041ee436d697449eb028931d574fb 2025-12-04T09:43:33.0504859Z * [new tag] trunk/c6ae7579fe12fe75f1a8f7043a494c90567273f1 -> trunk/c6ae7579fe12fe75f1a8f7043a494c90567273f1 2025-12-04T09:43:33.0506417Z * [new tag] trunk/c8210e7d94bad5ae21ac389fa4ba8a463c76c4d0 -> trunk/c8210e7d94bad5ae21ac389fa4ba8a463c76c4d0 2025-12-04T09:43:33.0507943Z * [new tag] trunk/cc0853af42122f8185321f542616f4474e717f09 -> trunk/cc0853af42122f8185321f542616f4474e717f09 2025-12-04T09:43:33.0509257Z * [new tag] trunk/cddec6562eabfa390d014fa3741a5659cf9c94c9 -> trunk/cddec6562eabfa390d014fa3741a5659cf9c94c9 2025-12-04T09:43:33.0510751Z * [new tag] trunk/ce5e7e3bf1f4b69a4f4f93d288ba75b906df492a -> trunk/ce5e7e3bf1f4b69a4f4f93d288ba75b906df492a 2025-12-04T09:43:33.0512175Z * [new tag] trunk/d038b0130ec7c20ebcac219301292fd8e98a1ace -> trunk/d038b0130ec7c20ebcac219301292fd8e98a1ace 2025-12-04T09:43:33.0513498Z * [new tag] trunk/d16447dacaf2420ea175f0c275c75da951f57d39 -> trunk/d16447dacaf2420ea175f0c275c75da951f57d39 2025-12-04T09:43:33.0514994Z * [new tag] trunk/d19f1e8cab6810bb2e99141f9976665954c67a50 -> trunk/d19f1e8cab6810bb2e99141f9976665954c67a50 2025-12-04T09:43:33.0516384Z * [new tag] trunk/d1c9f03b2a5af4104721712f8cdffe9b4f340c01 -> trunk/d1c9f03b2a5af4104721712f8cdffe9b4f340c01 2025-12-04T09:43:33.0517857Z * [new tag] trunk/d40f4950f2b7f7aa380a22fe0f6166e71680fbcf -> trunk/d40f4950f2b7f7aa380a22fe0f6166e71680fbcf 2025-12-04T09:43:33.0519310Z * [new tag] trunk/d5038950bacfe36bbf24a47a455fe76901deb8e8 -> trunk/d5038950bacfe36bbf24a47a455fe76901deb8e8 2025-12-04T09:43:33.0520663Z * [new tag] trunk/d54ff42903c2ae0533931ff11d23b35f875bdb3d -> trunk/d54ff42903c2ae0533931ff11d23b35f875bdb3d 2025-12-04T09:43:33.0522087Z * [new tag] trunk/d76697633a2d2b9cced1ae21161849b33bfe7e47 -> trunk/d76697633a2d2b9cced1ae21161849b33bfe7e47 2025-12-04T09:43:33.0523512Z * [new tag] trunk/d78f52b199c547106d4cd9d2856dd0805c118bf1 -> trunk/d78f52b199c547106d4cd9d2856dd0805c118bf1 2025-12-04T09:43:33.0524911Z * [new tag] trunk/d8fd5c6eed28e5004150691d048a3f6785e19a8e -> trunk/d8fd5c6eed28e5004150691d048a3f6785e19a8e 2025-12-04T09:43:33.0526298Z * [new tag] trunk/d900f5e86745dec76713f4b0ef07005ef36b2f5a -> trunk/d900f5e86745dec76713f4b0ef07005ef36b2f5a 2025-12-04T09:43:33.0527713Z * [new tag] trunk/d973dc6b87d763859fe1c5bd1287e3b6b1c49d1b -> trunk/d973dc6b87d763859fe1c5bd1287e3b6b1c49d1b 2025-12-04T09:43:33.0529163Z * [new tag] trunk/d998c03304cb6ede76e1ed535b4ddeb6c2bf40ec -> trunk/d998c03304cb6ede76e1ed535b4ddeb6c2bf40ec 2025-12-04T09:43:33.0530585Z * [new tag] trunk/d9cb8a70833101dbbe16b99520cfbdd70d0a87bf -> trunk/d9cb8a70833101dbbe16b99520cfbdd70d0a87bf 2025-12-04T09:43:33.0532110Z * [new tag] trunk/d9d5e91b43f70eb8637af55db6856d49be391ffd -> trunk/d9d5e91b43f70eb8637af55db6856d49be391ffd 2025-12-04T09:43:33.0533437Z * [new tag] trunk/dd18a75336a4fbd7497955cc5665904724fce889 -> trunk/dd18a75336a4fbd7497955cc5665904724fce889 2025-12-04T09:43:33.0534826Z * [new tag] trunk/ded9bcd61a059bf723e6e84689552962b480ea77 -> trunk/ded9bcd61a059bf723e6e84689552962b480ea77 2025-12-04T09:43:33.0536897Z * [new tag] trunk/dfbd3714d15c37a7b83b322a6b60f997fc00f50c -> trunk/dfbd3714d15c37a7b83b322a6b60f997fc00f50c 2025-12-04T09:43:33.0538564Z * [new tag] trunk/e115f9f4e4b039f8e9a642aaa2bd8254a920541b -> trunk/e115f9f4e4b039f8e9a642aaa2bd8254a920541b 2025-12-04T09:43:33.0539843Z * [new tag] trunk/e3f24fd73ad74c6e7176687986436956c7c18235 -> trunk/e3f24fd73ad74c6e7176687986436956c7c18235 2025-12-04T09:43:33.0541216Z * [new tag] trunk/e7d24d3ff93d1503ba63860b7057438ad93f918e -> trunk/e7d24d3ff93d1503ba63860b7057438ad93f918e 2025-12-04T09:43:33.0542697Z * [new tag] trunk/ea7035f462a0d2830865ee86c832bd101e1427fc -> trunk/ea7035f462a0d2830865ee86c832bd101e1427fc 2025-12-04T09:43:33.0544192Z * [new tag] trunk/eabb7ad2128580ef674446027b95bcf4e21e8df3 -> trunk/eabb7ad2128580ef674446027b95bcf4e21e8df3 2025-12-04T09:43:33.0545613Z * [new tag] trunk/eb5c63652a33da42e7018c23df5f20a3eb4c6ccf -> trunk/eb5c63652a33da42e7018c23df5f20a3eb4c6ccf 2025-12-04T09:43:33.0547021Z * [new tag] trunk/ec2c71f5c85021b8938cdafadce24c15a36fd93e -> trunk/ec2c71f5c85021b8938cdafadce24c15a36fd93e 2025-12-04T09:43:33.0548528Z * [new tag] trunk/ecbcc3f6bf327856b435b259ac63cc2f328c4b4e -> trunk/ecbcc3f6bf327856b435b259ac63cc2f328c4b4e 2025-12-04T09:43:33.0550265Z * [new tag] trunk/ee87bbe876c42575e961b32a0827d76bc9782ca2 -> trunk/ee87bbe876c42575e961b32a0827d76bc9782ca2 2025-12-04T09:43:33.0551662Z * [new tag] trunk/ef019d1d431c4c5a95b594cb90d40a50cd00f5e4 -> trunk/ef019d1d431c4c5a95b594cb90d40a50cd00f5e4 2025-12-04T09:43:33.0553084Z * [new tag] trunk/ef8ecc13830a86c4b231f1aad9aba7851db61b53 -> trunk/ef8ecc13830a86c4b231f1aad9aba7851db61b53 2025-12-04T09:43:33.0554475Z * [new tag] trunk/f1076f5510920044912247b1abb8760cb820f598 -> trunk/f1076f5510920044912247b1abb8760cb820f598 2025-12-04T09:43:33.0556092Z * [new tag] trunk/f2d6a75a00a1d648ca9a0abc6a33e14c3dea6c40 -> trunk/f2d6a75a00a1d648ca9a0abc6a33e14c3dea6c40 2025-12-04T09:43:33.0559199Z * [new tag] trunk/f47dd0ddef1359e5b43e4b962412f67b30ecde56 -> trunk/f47dd0ddef1359e5b43e4b962412f67b30ecde56 2025-12-04T09:43:33.0560731Z * [new tag] trunk/f49d32dfa4730dcfb1b60eeeb369b5889da983c8 -> trunk/f49d32dfa4730dcfb1b60eeeb369b5889da983c8 2025-12-04T09:43:33.0562118Z * [new tag] trunk/f4dedf78fc30fd4b93975787ca6074ee89db9467 -> trunk/f4dedf78fc30fd4b93975787ca6074ee89db9467 2025-12-04T09:43:33.0563578Z * [new tag] trunk/f7c0d03819ebed05c4038f095d66d1b8c54aca17 -> trunk/f7c0d03819ebed05c4038f095d66d1b8c54aca17 2025-12-04T09:43:33.0565109Z * [new tag] trunk/f7e1bd80a063e17453c361837ba6ea2570920a73 -> trunk/f7e1bd80a063e17453c361837ba6ea2570920a73 2025-12-04T09:43:33.0566450Z * [new tag] trunk/f9bd6c53624c7c0ea3772de78498326e84c2f0e7 -> trunk/f9bd6c53624c7c0ea3772de78498326e84c2f0e7 2025-12-04T09:43:33.0567942Z * [new tag] trunk/fb5be221a46b51bfc9509013b0d85bc5a9d4f15b -> trunk/fb5be221a46b51bfc9509013b0d85bc5a9d4f15b 2025-12-04T09:43:33.0569356Z * [new tag] trunk/fdf863d5e1de3b2688c9511e96876e34581dbfd7 -> trunk/fdf863d5e1de3b2688c9511e96876e34581dbfd7 2025-12-04T09:43:33.0571158Z * [new tag] trunk/fe0e65adfc0e7ca6e5f57e6ea8b16bd5cc967307 -> trunk/fe0e65adfc0e7ca6e5f57e6ea8b16bd5cc967307 2025-12-04T09:43:33.0572600Z * [new tag] trunk/fec710bf89173f5355468a7ce1afe9157c3d9009 -> trunk/fec710bf89173f5355468a7ce1afe9157c3d9009 2025-12-04T09:43:33.0574274Z * [new tag] trunk/ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 -> trunk/ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:43:33.0575305Z * [new tag] v0.1.1 -> v0.1.1 2025-12-04T09:43:33.0576696Z * [new tag] v0.1.10 -> v0.1.10 2025-12-04T09:43:33.0577997Z * [new tag] v0.1.11 -> v0.1.11 2025-12-04T09:43:33.0579373Z * [new tag] v0.1.12 -> v0.1.12 2025-12-04T09:43:33.0580719Z * [new tag] v0.1.2 -> v0.1.2 2025-12-04T09:43:33.0581987Z * [new tag] v0.1.3 -> v0.1.3 2025-12-04T09:43:33.0583265Z * [new tag] v0.1.4 -> v0.1.4 2025-12-04T09:43:33.0584602Z * [new tag] v0.1.5 -> v0.1.5 2025-12-04T09:43:33.0585954Z * [new tag] v0.1.6 -> v0.1.6 2025-12-04T09:43:33.0587791Z * [new tag] v0.1.7 -> v0.1.7 2025-12-04T09:43:33.0589249Z * [new tag] v0.1.8 -> v0.1.8 2025-12-04T09:43:33.0590631Z * [new tag] v0.1.9 -> v0.1.9 2025-12-04T09:43:33.0592001Z * [new tag] v0.2.0 -> v0.2.0 2025-12-04T09:43:33.0593344Z * [new tag] v0.3.0 -> v0.3.0 2025-12-04T09:43:33.0594759Z * [new tag] v0.3.1 -> v0.3.1 2025-12-04T09:43:33.0596154Z * [new tag] v0.4.0 -> v0.4.0 2025-12-04T09:43:33.0597444Z * [new tag] v0.4.1 -> v0.4.1 2025-12-04T09:43:33.0598781Z * [new tag] v1.0.0 -> v1.0.0 2025-12-04T09:43:33.0600141Z * [new tag] v1.0.0a0 -> v1.0.0a0 2025-12-04T09:43:33.0601515Z * [new tag] v1.0.1 -> v1.0.1 2025-12-04T09:43:33.0602860Z * [new tag] v1.0rc0 -> v1.0rc0 2025-12-04T09:43:33.0604062Z * [new tag] v1.0rc1 -> v1.0rc1 2025-12-04T09:43:33.0605381Z * [new tag] v1.1.0 -> v1.1.0 2025-12-04T09:43:33.0606765Z * [new tag] v1.1.0a0 -> v1.1.0a0 2025-12-04T09:43:33.0608237Z * [new tag] v1.10.0 -> v1.10.0 2025-12-04T09:43:33.0609732Z * [new tag] v1.10.0-rc1 -> v1.10.0-rc1 2025-12-04T09:43:33.0611092Z * [new tag] v1.10.0-rc2 -> v1.10.0-rc2 2025-12-04T09:43:33.0612323Z * [new tag] v1.10.0-rc3 -> v1.10.0-rc3 2025-12-04T09:43:33.0613757Z * [new tag] v1.10.1 -> v1.10.1 2025-12-04T09:43:33.0615133Z * [new tag] v1.10.1-rc1 -> v1.10.1-rc1 2025-12-04T09:43:33.0616169Z * [new tag] v1.10.2 -> v1.10.2 2025-12-04T09:43:33.0617426Z * [new tag] v1.10.2-rc1 -> v1.10.2-rc1 2025-12-04T09:43:33.0618807Z * [new tag] v1.11.0 -> v1.11.0 2025-12-04T09:43:33.0620173Z * [new tag] v1.11.0-rc1 -> v1.11.0-rc1 2025-12-04T09:43:33.0621671Z * [new tag] v1.11.0-rc2 -> v1.11.0-rc2 2025-12-04T09:43:33.0623041Z * [new tag] v1.11.0-rc3 -> v1.11.0-rc3 2025-12-04T09:43:33.0624449Z * [new tag] v1.11.0-rc4 -> v1.11.0-rc4 2025-12-04T09:43:33.0625827Z * [new tag] v1.11.0-rc5 -> v1.11.0-rc5 2025-12-04T09:43:33.0626952Z * [new tag] v1.11.0-rc6 -> v1.11.0-rc6 2025-12-04T09:43:33.0628194Z * [new tag] v1.11.0-rc7 -> v1.11.0-rc7 2025-12-04T09:43:33.0629839Z * [new tag] v1.12.0 -> v1.12.0 2025-12-04T09:43:33.0631069Z * [new tag] v1.12.0-rc1 -> v1.12.0-rc1 2025-12-04T09:43:33.0632521Z * [new tag] v1.12.0-rc2 -> v1.12.0-rc2 2025-12-04T09:43:33.0633898Z * [new tag] v1.12.0-rc3 -> v1.12.0-rc3 2025-12-04T09:43:33.0635275Z * [new tag] v1.12.0-rc4 -> v1.12.0-rc4 2025-12-04T09:43:33.0636603Z * [new tag] v1.12.0-rc5 -> v1.12.0-rc5 2025-12-04T09:43:33.0638233Z * [new tag] v1.12.0-rc6 -> v1.12.0-rc6 2025-12-04T09:43:33.0639185Z * [new tag] v1.12.0-rc7 -> v1.12.0-rc7 2025-12-04T09:43:33.0640514Z * [new tag] v1.12.0-rc8 -> v1.12.0-rc8 2025-12-04T09:43:33.0641523Z * [new tag] v1.12.1 -> v1.12.1 2025-12-04T09:43:33.0643196Z * [new tag] v1.12.1-rc1 -> v1.12.1-rc1 2025-12-04T09:43:33.0644474Z * [new tag] v1.12.1-rc2 -> v1.12.1-rc2 2025-12-04T09:43:33.0645927Z * [new tag] v1.12.1-rc3 -> v1.12.1-rc3 2025-12-04T09:43:33.0647248Z * [new tag] v1.12.1-rc4 -> v1.12.1-rc4 2025-12-04T09:43:33.0648427Z * [new tag] v1.12.1-rc5 -> v1.12.1-rc5 2025-12-04T09:43:33.0649828Z * [new tag] v1.13.0 -> v1.13.0 2025-12-04T09:43:33.0651117Z * [new tag] v1.13.0-rc1 -> v1.13.0-rc1 2025-12-04T09:43:33.0652547Z * [new tag] v1.13.0-rc2 -> v1.13.0-rc2 2025-12-04T09:43:33.0653918Z * [new tag] v1.13.0-rc3 -> v1.13.0-rc3 2025-12-04T09:43:33.0655481Z * [new tag] v1.13.0-rc4 -> v1.13.0-rc4 2025-12-04T09:43:33.0656819Z * [new tag] v1.13.0-rc5 -> v1.13.0-rc5 2025-12-04T09:43:33.0657820Z * [new tag] v1.13.0-rc6 -> v1.13.0-rc6 2025-12-04T09:43:33.0659418Z * [new tag] v1.13.1 -> v1.13.1 2025-12-04T09:43:33.0660429Z * [new tag] v1.13.1-rc1 -> v1.13.1-rc1 2025-12-04T09:43:33.0661896Z * [new tag] v1.2.0 -> v1.2.0 2025-12-04T09:43:33.0663265Z * [new tag] v1.2.0a0 -> v1.2.0a0 2025-12-04T09:43:33.0664649Z * [new tag] v1.3.0 -> v1.3.0 2025-12-04T09:43:33.0665937Z * [new tag] v1.3.0a0 -> v1.3.0a0 2025-12-04T09:43:33.0666939Z * [new tag] v1.3.1 -> v1.3.1 2025-12-04T09:43:33.0668506Z * [new tag] v1.4.0 -> v1.4.0 2025-12-04T09:43:33.0669964Z * [new tag] v1.4.0a0 -> v1.4.0a0 2025-12-04T09:43:33.0670916Z * [new tag] v1.4.1 -> v1.4.1 2025-12-04T09:43:33.0672435Z * [new tag] v1.5.0 -> v1.5.0 2025-12-04T09:43:33.0674565Z * [new tag] v1.5.0-rc1 -> v1.5.0-rc1 2025-12-04T09:43:33.0676041Z * [new tag] v1.5.0-rc2 -> v1.5.0-rc2 2025-12-04T09:43:33.0677452Z * [new tag] v1.5.0-rc3 -> v1.5.0-rc3 2025-12-04T09:43:33.0678720Z * [new tag] v1.5.0-rc4 -> v1.5.0-rc4 2025-12-04T09:43:33.0679722Z * [new tag] v1.5.0-rc5 -> v1.5.0-rc5 2025-12-04T09:43:33.0681294Z * [new tag] v1.5.1 -> v1.5.1 2025-12-04T09:43:33.0682302Z * [new tag] v1.5.1-rc1 -> v1.5.1-rc1 2025-12-04T09:43:33.0683611Z * [new tag] v1.6.0 -> v1.6.0 2025-12-04T09:43:33.0684966Z * [new tag] v1.6.0-rc1 -> v1.6.0-rc1 2025-12-04T09:43:33.0686554Z * [new tag] v1.6.0-rc2 -> v1.6.0-rc2 2025-12-04T09:43:33.0687638Z * [new tag] v1.6.0-rc3 -> v1.6.0-rc3 2025-12-04T09:43:33.0689073Z * [new tag] v1.6.0-rc4 -> v1.6.0-rc4 2025-12-04T09:43:33.0690635Z * [new tag] v1.6.0-rc5 -> v1.6.0-rc5 2025-12-04T09:43:33.0692162Z * [new tag] v1.6.0-rc6 -> v1.6.0-rc6 2025-12-04T09:43:33.0693375Z * [new tag] v1.6.0-rc7 -> v1.6.0-rc7 2025-12-04T09:43:33.0694742Z * [new tag] v1.7.0 -> v1.7.0 2025-12-04T09:43:33.0696187Z * [new tag] v1.7.0-rc1 -> v1.7.0-rc1 2025-12-04T09:43:33.0697748Z * [new tag] v1.7.0-rc2 -> v1.7.0-rc2 2025-12-04T09:43:33.0699041Z * [new tag] v1.7.0-rc3 -> v1.7.0-rc3 2025-12-04T09:43:33.0700338Z * [new tag] v1.7.0-rc4 -> v1.7.0-rc4 2025-12-04T09:43:33.0701725Z * [new tag] v1.7.1 -> v1.7.1 2025-12-04T09:43:33.0703249Z * [new tag] v1.7.1-rc1 -> v1.7.1-rc1 2025-12-04T09:43:33.0704586Z * [new tag] v1.7.1-rc2 -> v1.7.1-rc2 2025-12-04T09:43:33.0705601Z * [new tag] v1.7.1-rc3 -> v1.7.1-rc3 2025-12-04T09:43:33.0707105Z * [new tag] v1.8.0 -> v1.8.0 2025-12-04T09:43:33.0708476Z * [new tag] v1.8.0-rc1 -> v1.8.0-rc1 2025-12-04T09:43:33.0709775Z * [new tag] v1.8.0-rc2 -> v1.8.0-rc2 2025-12-04T09:43:33.0711165Z * [new tag] v1.8.0-rc3 -> v1.8.0-rc3 2025-12-04T09:43:33.0712461Z * [new tag] v1.8.0-rc4 -> v1.8.0-rc4 2025-12-04T09:43:33.0713510Z * [new tag] v1.8.0-rc5 -> v1.8.0-rc5 2025-12-04T09:43:33.0714707Z * [new tag] v1.8.1 -> v1.8.1 2025-12-04T09:43:33.0716119Z * [new tag] v1.8.1-rc1 -> v1.8.1-rc1 2025-12-04T09:43:33.0717144Z * [new tag] v1.8.1-rc2 -> v1.8.1-rc2 2025-12-04T09:43:33.0718519Z * [new tag] v1.8.1-rc3 -> v1.8.1-rc3 2025-12-04T09:43:33.0720197Z * [new tag] v1.8.2 -> v1.8.2 2025-12-04T09:43:33.0721421Z * [new tag] v1.8.2-rc1 -> v1.8.2-rc1 2025-12-04T09:43:33.0722718Z * [new tag] v1.9.0 -> v1.9.0 2025-12-04T09:43:33.0724151Z * [new tag] v1.9.0-rc1 -> v1.9.0-rc1 2025-12-04T09:43:33.0725508Z * [new tag] v1.9.0-rc2 -> v1.9.0-rc2 2025-12-04T09:43:33.0726935Z * [new tag] v1.9.0-rc3 -> v1.9.0-rc3 2025-12-04T09:43:33.0728052Z * [new tag] v1.9.0-rc4 -> v1.9.0-rc4 2025-12-04T09:43:33.0729356Z * [new tag] v1.9.1 -> v1.9.1 2025-12-04T09:43:33.0730837Z * [new tag] v1.9.1-rc1 -> v1.9.1-rc1 2025-12-04T09:43:33.0732066Z * [new tag] v1.9.1-rc2 -> v1.9.1-rc2 2025-12-04T09:43:33.0733406Z * [new tag] v2.0.0 -> v2.0.0 2025-12-04T09:43:33.0734742Z * [new tag] v2.0.0-rc1 -> v2.0.0-rc1 2025-12-04T09:43:33.0736088Z * [new tag] v2.0.0-rc2 -> v2.0.0-rc2 2025-12-04T09:43:33.0737510Z * [new tag] v2.0.0-rc3 -> v2.0.0-rc3 2025-12-04T09:43:33.0738896Z * [new tag] v2.0.0-rc4 -> v2.0.0-rc4 2025-12-04T09:43:33.0740292Z * [new tag] v2.0.0-rc5 -> v2.0.0-rc5 2025-12-04T09:43:33.0741732Z * [new tag] v2.0.0-rc6 -> v2.0.0-rc6 2025-12-04T09:43:33.0743077Z * [new tag] v2.0.1 -> v2.0.1 2025-12-04T09:43:33.0744420Z * [new tag] v2.0.1-rc1 -> v2.0.1-rc1 2025-12-04T09:43:33.0745452Z * [new tag] v2.0.1-rc2 -> v2.0.1-rc2 2025-12-04T09:43:33.0746851Z * [new tag] v2.0.1-rc3 -> v2.0.1-rc3 2025-12-04T09:43:33.0747995Z * [new tag] v2.0.1-rc4 -> v2.0.1-rc4 2025-12-04T09:43:33.0749893Z * [new tag] v2.1.0 -> v2.1.0 2025-12-04T09:43:33.0751296Z * [new tag] v2.1.0-rc1 -> v2.1.0-rc1 2025-12-04T09:43:33.0752654Z * [new tag] v2.1.0-rc2 -> v2.1.0-rc2 2025-12-04T09:43:33.0754064Z * [new tag] v2.1.0-rc3 -> v2.1.0-rc3 2025-12-04T09:43:33.0755599Z * [new tag] v2.1.0-rc4 -> v2.1.0-rc4 2025-12-04T09:43:33.0757136Z * [new tag] v2.1.0-rc5 -> v2.1.0-rc5 2025-12-04T09:43:33.0758154Z * [new tag] v2.1.0-rc6 -> v2.1.0-rc6 2025-12-04T09:43:33.0759724Z * [new tag] v2.1.1 -> v2.1.1 2025-12-04T09:43:33.0761616Z * [new tag] v2.1.1-rc1 -> v2.1.1-rc1 2025-12-04T09:43:33.0763075Z * [new tag] v2.1.1-rc2 -> v2.1.1-rc2 2025-12-04T09:43:33.0764485Z * [new tag] v2.1.1-rc3 -> v2.1.1-rc3 2025-12-04T09:43:33.0765896Z * [new tag] v2.1.1-rc4 -> v2.1.1-rc4 2025-12-04T09:43:33.0767217Z * [new tag] v2.1.1-rc5 -> v2.1.1-rc5 2025-12-04T09:43:33.0768331Z * [new tag] v2.1.1-rc6 -> v2.1.1-rc6 2025-12-04T09:43:33.0769843Z * [new tag] v2.1.2 -> v2.1.2 2025-12-04T09:43:33.0771284Z * [new tag] v2.1.2-rc1 -> v2.1.2-rc1 2025-12-04T09:43:33.0772652Z * [new tag] v2.1.2-rc2 -> v2.1.2-rc2 2025-12-04T09:43:33.0773898Z * [new tag] v2.1.2-rc3 -> v2.1.2-rc3 2025-12-04T09:43:33.0775228Z * [new tag] v2.2.0 -> v2.2.0 2025-12-04T09:43:33.0776533Z * [new tag] v2.2.0-rc1 -> v2.2.0-rc1 2025-12-04T09:43:33.0777838Z * [new tag] v2.2.0-rc2 -> v2.2.0-rc2 2025-12-04T09:43:33.0779209Z * [new tag] v2.2.0-rc3 -> v2.2.0-rc3 2025-12-04T09:43:33.0780591Z * [new tag] v2.2.0-rc4 -> v2.2.0-rc4 2025-12-04T09:43:33.0781876Z * [new tag] v2.2.0-rc5 -> v2.2.0-rc5 2025-12-04T09:43:33.0783285Z * [new tag] v2.2.0-rc6 -> v2.2.0-rc6 2025-12-04T09:43:33.0784551Z * [new tag] v2.2.0-rc7 -> v2.2.0-rc7 2025-12-04T09:43:33.0785497Z * [new tag] v2.2.0-rc8 -> v2.2.0-rc8 2025-12-04T09:43:33.0787002Z * [new tag] v2.2.1 -> v2.2.1 2025-12-04T09:43:33.0788512Z * [new tag] v2.2.1-rc1 -> v2.2.1-rc1 2025-12-04T09:43:33.0789651Z * [new tag] v2.2.1-rc2 -> v2.2.1-rc2 2025-12-04T09:43:33.0790685Z * [new tag] v2.2.1-rc3 -> v2.2.1-rc3 2025-12-04T09:43:33.0791906Z * [new tag] v2.2.2 -> v2.2.2 2025-12-04T09:43:33.0793353Z * [new tag] v2.2.2-rc1 -> v2.2.2-rc1 2025-12-04T09:43:33.0794561Z * [new tag] v2.2.2-rc2 -> v2.2.2-rc2 2025-12-04T09:43:33.0795571Z * [new tag] v2.2.2-rc3 -> v2.2.2-rc3 2025-12-04T09:43:33.0797227Z * [new tag] v2.3.0 -> v2.3.0 2025-12-04T09:43:33.0798467Z * [new tag] v2.3.0-rc1 -> v2.3.0-rc1 2025-12-04T09:43:33.0799872Z * [new tag] v2.3.0-rc10 -> v2.3.0-rc10 2025-12-04T09:43:33.0801207Z * [new tag] v2.3.0-rc11 -> v2.3.0-rc11 2025-12-04T09:43:33.0802374Z * [new tag] v2.3.0-rc12 -> v2.3.0-rc12 2025-12-04T09:43:33.0803838Z * [new tag] v2.3.0-rc2 -> v2.3.0-rc2 2025-12-04T09:43:33.0805224Z * [new tag] v2.3.0-rc3 -> v2.3.0-rc3 2025-12-04T09:43:33.0806547Z * [new tag] v2.3.0-rc4 -> v2.3.0-rc4 2025-12-04T09:43:33.0807878Z * [new tag] v2.3.0-rc5 -> v2.3.0-rc5 2025-12-04T09:43:33.0809120Z * [new tag] v2.3.0-rc6 -> v2.3.0-rc6 2025-12-04T09:43:33.0810569Z * [new tag] v2.3.0-rc7 -> v2.3.0-rc7 2025-12-04T09:43:33.0811833Z * [new tag] v2.3.0-rc8 -> v2.3.0-rc8 2025-12-04T09:43:33.0812935Z * [new tag] v2.3.0-rc9 -> v2.3.0-rc9 2025-12-04T09:43:33.0814166Z * [new tag] v2.3.1 -> v2.3.1 2025-12-04T09:43:33.0815590Z * [new tag] v2.3.1-rc1 -> v2.3.1-rc1 2025-12-04T09:43:33.0816913Z * [new tag] v2.3.1-rc2 -> v2.3.1-rc2 2025-12-04T09:43:33.0818316Z * [new tag] v2.3.1-rc3 -> v2.3.1-rc3 2025-12-04T09:43:33.0819672Z * [new tag] v2.4.0 -> v2.4.0 2025-12-04T09:43:33.0821072Z * [new tag] v2.4.0-rc1 -> v2.4.0-rc1 2025-12-04T09:43:33.0822310Z * [new tag] v2.4.0-rc2 -> v2.4.0-rc2 2025-12-04T09:43:33.0823643Z * [new tag] v2.4.0-rc3 -> v2.4.0-rc3 2025-12-04T09:43:33.0825045Z * [new tag] v2.4.0-rc4 -> v2.4.0-rc4 2025-12-04T09:43:33.0826460Z * [new tag] v2.4.0-rc5 -> v2.4.0-rc5 2025-12-04T09:43:33.0827961Z * [new tag] v2.4.0-rc6 -> v2.4.0-rc6 2025-12-04T09:43:33.0829364Z * [new tag] v2.4.0-rc7 -> v2.4.0-rc7 2025-12-04T09:43:33.0830709Z * [new tag] v2.4.0-rc8 -> v2.4.0-rc8 2025-12-04T09:43:33.0832140Z * [new tag] v2.4.0-rc9 -> v2.4.0-rc9 2025-12-04T09:43:33.0833139Z * [new tag] v2.4.1 -> v2.4.1 2025-12-04T09:43:33.0834653Z * [new tag] v2.4.1-rc1 -> v2.4.1-rc1 2025-12-04T09:43:33.0836024Z * [new tag] v2.4.1-rc2 -> v2.4.1-rc2 2025-12-04T09:43:33.0837465Z * [new tag] v2.4.1-rc3 -> v2.4.1-rc3 2025-12-04T09:43:33.0838749Z * [new tag] v2.5.0 -> v2.5.0 2025-12-04T09:43:33.0840107Z * [new tag] v2.5.0-rc1 -> v2.5.0-rc1 2025-12-04T09:43:33.0841136Z * [new tag] v2.5.0-rc10 -> v2.5.0-rc10 2025-12-04T09:43:33.0842633Z * [new tag] v2.5.0-rc2 -> v2.5.0-rc2 2025-12-04T09:43:33.0843902Z * [new tag] v2.5.0-rc3 -> v2.5.0-rc3 2025-12-04T09:43:33.0845277Z * [new tag] v2.5.0-rc4 -> v2.5.0-rc4 2025-12-04T09:43:33.0846965Z * [new tag] v2.5.0-rc5 -> v2.5.0-rc5 2025-12-04T09:43:33.0848440Z * [new tag] v2.5.0-rc6 -> v2.5.0-rc6 2025-12-04T09:43:33.0849766Z * [new tag] v2.5.0-rc7 -> v2.5.0-rc7 2025-12-04T09:43:33.0851089Z * [new tag] v2.5.0-rc8 -> v2.5.0-rc8 2025-12-04T09:43:33.0852562Z * [new tag] v2.5.0-rc9 -> v2.5.0-rc9 2025-12-04T09:43:33.0853577Z * [new tag] v2.5.1 -> v2.5.1 2025-12-04T09:43:33.0854737Z * [new tag] v2.5.1-rc1 -> v2.5.1-rc1 2025-12-04T09:43:33.0855833Z * [new tag] v2.6.0 -> v2.6.0 2025-12-04T09:43:33.0857551Z * [new tag] v2.6.0-rc1 -> v2.6.0-rc1 2025-12-04T09:43:33.0859042Z * [new tag] v2.6.0-rc2 -> v2.6.0-rc2 2025-12-04T09:43:33.0860420Z * [new tag] v2.6.0-rc3 -> v2.6.0-rc3 2025-12-04T09:43:33.0861738Z * [new tag] v2.6.0-rc4 -> v2.6.0-rc4 2025-12-04T09:43:33.0863255Z * [new tag] v2.6.0-rc5 -> v2.6.0-rc5 2025-12-04T09:43:33.0864786Z * [new tag] v2.6.0-rc6 -> v2.6.0-rc6 2025-12-04T09:43:33.0866089Z * [new tag] v2.6.0-rc7 -> v2.6.0-rc7 2025-12-04T09:43:33.0867651Z * [new tag] v2.6.0-rc8 -> v2.6.0-rc8 2025-12-04T09:43:33.0869125Z * [new tag] v2.6.0-rc9 -> v2.6.0-rc9 2025-12-04T09:43:33.0870661Z * [new tag] v2.7.0 -> v2.7.0 2025-12-04T09:43:33.0871971Z * [new tag] v2.7.0-rc1 -> v2.7.0-rc1 2025-12-04T09:43:33.0873127Z * [new tag] v2.7.0-rc10 -> v2.7.0-rc10 2025-12-04T09:43:33.0874629Z * [new tag] v2.7.0-rc2 -> v2.7.0-rc2 2025-12-04T09:43:33.0876080Z * [new tag] v2.7.0-rc3 -> v2.7.0-rc3 2025-12-04T09:43:33.0877427Z * [new tag] v2.7.0-rc4 -> v2.7.0-rc4 2025-12-04T09:43:33.0878732Z * [new tag] v2.7.0-rc5 -> v2.7.0-rc5 2025-12-04T09:43:33.0880105Z * [new tag] v2.7.0-rc6 -> v2.7.0-rc6 2025-12-04T09:43:33.0881499Z * [new tag] v2.7.0-rc7 -> v2.7.0-rc7 2025-12-04T09:43:33.0882828Z * [new tag] v2.7.0-rc8 -> v2.7.0-rc8 2025-12-04T09:43:33.0884231Z * [new tag] v2.7.0-rc9 -> v2.7.0-rc9 2025-12-04T09:43:33.0885365Z * [new tag] v2.7.1 -> v2.7.1 2025-12-04T09:43:33.0886902Z * [new tag] v2.7.1-rc1 -> v2.7.1-rc1 2025-12-04T09:43:33.0888340Z * [new tag] v2.7.1-rc2 -> v2.7.1-rc2 2025-12-04T09:43:33.0889773Z * [new tag] v2.7.1-rc3 -> v2.7.1-rc3 2025-12-04T09:43:33.0891224Z * [new tag] v2.7.1-rc4 -> v2.7.1-rc4 2025-12-04T09:43:33.0892661Z * [new tag] v2.7.1-rc5 -> v2.7.1-rc5 2025-12-04T09:43:33.0893712Z * [new tag] v2.8.0 -> v2.8.0 2025-12-04T09:43:33.0895233Z * [new tag] v2.8.0-rc1 -> v2.8.0-rc1 2025-12-04T09:43:33.0896612Z * [new tag] v2.8.0-rc2 -> v2.8.0-rc2 2025-12-04T09:43:33.0898151Z * [new tag] v2.8.0-rc3 -> v2.8.0-rc3 2025-12-04T09:43:33.0899541Z * [new tag] v2.8.0-rc4 -> v2.8.0-rc4 2025-12-04T09:43:33.0900904Z * [new tag] v2.8.0-rc5 -> v2.8.0-rc5 2025-12-04T09:43:33.0902349Z * [new tag] v2.8.0-rc6 -> v2.8.0-rc6 2025-12-04T09:43:33.0903808Z * [new tag] v2.8.0-rc7 -> v2.8.0-rc7 2025-12-04T09:43:33.0905130Z * [new tag] v2.8.0-rc8 -> v2.8.0-rc8 2025-12-04T09:43:33.0906546Z * [new tag] v2.9.0 -> v2.9.0 2025-12-04T09:43:33.0908011Z * [new tag] v2.9.0-rc1 -> v2.9.0-rc1 2025-12-04T09:43:33.0909613Z * [new tag] v2.9.0-rc10 -> v2.9.0-rc10 2025-12-04T09:43:33.0910822Z * [new tag] v2.9.0-rc11 -> v2.9.0-rc11 2025-12-04T09:43:33.0912380Z * [new tag] v2.9.0-rc2 -> v2.9.0-rc2 2025-12-04T09:43:33.0913899Z * [new tag] v2.9.0-rc3 -> v2.9.0-rc3 2025-12-04T09:43:33.0915375Z * [new tag] v2.9.0-rc4 -> v2.9.0-rc4 2025-12-04T09:43:33.0916780Z * [new tag] v2.9.0-rc5 -> v2.9.0-rc5 2025-12-04T09:43:33.0918382Z * [new tag] v2.9.0-rc6 -> v2.9.0-rc6 2025-12-04T09:43:33.0919753Z * [new tag] v2.9.0-rc7 -> v2.9.0-rc7 2025-12-04T09:43:33.0921325Z * [new tag] v2.9.0-rc8 -> v2.9.0-rc8 2025-12-04T09:43:33.0922396Z * [new tag] v2.9.0-rc9 -> v2.9.0-rc9 2025-12-04T09:43:33.0923686Z * [new tag] v2.9.1 -> v2.9.1 2025-12-04T09:43:33.0925090Z * [new tag] v2.9.1-rc1 -> v2.9.1-rc1 2025-12-04T09:43:33.0926569Z * [new tag] v2.9.1-rc2 -> v2.9.1-rc2 2025-12-04T09:43:33.0928371Z * [new tag] viable/strict/1759343184 -> viable/strict/1759343184 2025-12-04T09:43:33.0929808Z * [new tag] viable/strict/1759346540 -> viable/strict/1759346540 2025-12-04T09:43:33.0931093Z * [new tag] viable/strict/1759348181 -> viable/strict/1759348181 2025-12-04T09:43:33.0932420Z * [new tag] viable/strict/1759350324 -> viable/strict/1759350324 2025-12-04T09:43:33.0933693Z * [new tag] viable/strict/1759351793 -> viable/strict/1759351793 2025-12-04T09:43:33.0934965Z * [new tag] viable/strict/1759353844 -> viable/strict/1759353844 2025-12-04T09:43:33.0936276Z * [new tag] viable/strict/1759355374 -> viable/strict/1759355374 2025-12-04T09:43:33.0937621Z * [new tag] viable/strict/1759357472 -> viable/strict/1759357472 2025-12-04T09:43:33.0938802Z * [new tag] viable/strict/1759361002 -> viable/strict/1759361002 2025-12-04T09:43:33.0940460Z * [new tag] viable/strict/1759362585 -> viable/strict/1759362585 2025-12-04T09:43:33.0942012Z * [new tag] viable/strict/1759365359 -> viable/strict/1759365359 2025-12-04T09:43:33.0943501Z * [new tag] viable/strict/1759370089 -> viable/strict/1759370089 2025-12-04T09:43:33.0944849Z * [new tag] viable/strict/1759377554 -> viable/strict/1759377554 2025-12-04T09:43:33.0946199Z * [new tag] viable/strict/1759379133 -> viable/strict/1759379133 2025-12-04T09:43:33.0947700Z * [new tag] viable/strict/1759389871 -> viable/strict/1759389871 2025-12-04T09:43:33.0949151Z * [new tag] viable/strict/1759393562 -> viable/strict/1759393562 2025-12-04T09:43:33.0950469Z * [new tag] viable/strict/1759395076 -> viable/strict/1759395076 2025-12-04T09:43:33.0951896Z * [new tag] viable/strict/1759398579 -> viable/strict/1759398579 2025-12-04T09:43:33.0953368Z * [new tag] viable/strict/1759404142 -> viable/strict/1759404142 2025-12-04T09:43:33.0954744Z * [new tag] viable/strict/1759405773 -> viable/strict/1759405773 2025-12-04T09:43:33.0956406Z * [new tag] viable/strict/1759408041 -> viable/strict/1759408041 2025-12-04T09:43:33.0958055Z * [new tag] viable/strict/1759411593 -> viable/strict/1759411593 2025-12-04T09:43:33.0959464Z * [new tag] viable/strict/1759427395 -> viable/strict/1759427395 2025-12-04T09:43:33.0960879Z * [new tag] viable/strict/1759434582 -> viable/strict/1759434582 2025-12-04T09:43:33.0962238Z * [new tag] viable/strict/1759436720 -> viable/strict/1759436720 2025-12-04T09:43:33.0963719Z * [new tag] viable/strict/1759440219 -> viable/strict/1759440219 2025-12-04T09:43:33.0965023Z * [new tag] viable/strict/1759441948 -> viable/strict/1759441948 2025-12-04T09:43:33.0966363Z * [new tag] viable/strict/1759443860 -> viable/strict/1759443860 2025-12-04T09:43:33.0967685Z * [new tag] viable/strict/1759445377 -> viable/strict/1759445377 2025-12-04T09:43:33.0969121Z * [new tag] viable/strict/1759447415 -> viable/strict/1759447415 2025-12-04T09:43:33.0970432Z * [new tag] viable/strict/1759451750 -> viable/strict/1759451750 2025-12-04T09:43:33.0971852Z * [new tag] viable/strict/1759453910 -> viable/strict/1759453910 2025-12-04T09:43:33.0973247Z * [new tag] viable/strict/1759456483 -> viable/strict/1759456483 2025-12-04T09:43:33.0974635Z * [new tag] viable/strict/1759459279 -> viable/strict/1759459279 2025-12-04T09:43:33.0976056Z * [new tag] viable/strict/1759460742 -> viable/strict/1759460742 2025-12-04T09:43:33.0977392Z * [new tag] viable/strict/1759462025 -> viable/strict/1759462025 2025-12-04T09:43:33.0978839Z * [new tag] viable/strict/1759469086 -> viable/strict/1759469086 2025-12-04T09:43:33.0980192Z * [new tag] viable/strict/1759470581 -> viable/strict/1759470581 2025-12-04T09:43:33.0981648Z * [new tag] viable/strict/1759472786 -> viable/strict/1759472786 2025-12-04T09:43:33.0983004Z * [new tag] viable/strict/1759476294 -> viable/strict/1759476294 2025-12-04T09:43:33.0984336Z * [new tag] viable/strict/1759479963 -> viable/strict/1759479963 2025-12-04T09:43:33.0985711Z * [new tag] viable/strict/1759492177 -> viable/strict/1759492177 2025-12-04T09:43:33.0987096Z * [new tag] viable/strict/1759519278 -> viable/strict/1759519278 2025-12-04T09:43:33.0988592Z * [new tag] viable/strict/1759524580 -> viable/strict/1759524580 2025-12-04T09:43:33.0989931Z * [new tag] viable/strict/1759528193 -> viable/strict/1759528193 2025-12-04T09:43:33.0991421Z * [new tag] viable/strict/1759533797 -> viable/strict/1759533797 2025-12-04T09:43:33.0992938Z * [new tag] viable/strict/1759542780 -> viable/strict/1759542780 2025-12-04T09:43:33.0994232Z * [new tag] viable/strict/1759549779 -> viable/strict/1759549779 2025-12-04T09:43:33.0995601Z * [new tag] viable/strict/1759555455 -> viable/strict/1759555455 2025-12-04T09:43:33.0996980Z * [new tag] viable/strict/1759559176 -> viable/strict/1759559176 2025-12-04T09:43:33.0998433Z * [new tag] viable/strict/1759560629 -> viable/strict/1759560629 2025-12-04T09:43:33.0999756Z * [new tag] viable/strict/1759569848 -> viable/strict/1759569848 2025-12-04T09:43:33.1001268Z * [new tag] viable/strict/1759571382 -> viable/strict/1759571382 2025-12-04T09:43:33.1002662Z * [new tag] viable/strict/1759573474 -> viable/strict/1759573474 2025-12-04T09:43:33.1004071Z * [new tag] viable/strict/1759618187 -> viable/strict/1759618187 2025-12-04T09:43:33.1005374Z * [new tag] viable/strict/1759626742 -> viable/strict/1759626742 2025-12-04T09:43:33.1006771Z * [new tag] viable/strict/1759632427 -> viable/strict/1759632427 2025-12-04T09:43:33.1008113Z * [new tag] viable/strict/1759634971 -> viable/strict/1759634971 2025-12-04T09:43:33.1009569Z * [new tag] viable/strict/1759661382 -> viable/strict/1759661382 2025-12-04T09:43:33.1010958Z * [new tag] viable/strict/1759663294 -> viable/strict/1759663294 2025-12-04T09:43:33.1012118Z * [new tag] viable/strict/1759708178 -> viable/strict/1759708178 2025-12-04T09:43:33.1014056Z * [new tag] viable/strict/1759715695 -> viable/strict/1759715695 2025-12-04T09:43:33.1015472Z * [new tag] viable/strict/1759728293 -> viable/strict/1759728293 2025-12-04T09:43:33.1016797Z * [new tag] viable/strict/1759735513 -> viable/strict/1759735513 2025-12-04T09:43:33.1018184Z * [new tag] viable/strict/1759739177 -> viable/strict/1759739177 2025-12-04T09:43:33.1019566Z * [new tag] viable/strict/1759758635 -> viable/strict/1759758635 2025-12-04T09:43:33.1021029Z * [new tag] viable/strict/1759765784 -> viable/strict/1759765784 2025-12-04T09:43:33.1022311Z * [new tag] viable/strict/1759767948 -> viable/strict/1759767948 2025-12-04T09:43:33.1023706Z * [new tag] viable/strict/1759771461 -> viable/strict/1759771461 2025-12-04T09:43:33.1025038Z * [new tag] viable/strict/1759776706 -> viable/strict/1759776706 2025-12-04T09:43:33.1026460Z * [new tag] viable/strict/1759782317 -> viable/strict/1759782317 2025-12-04T09:43:33.1027930Z * [new tag] viable/strict/1759783777 -> viable/strict/1759783777 2025-12-04T09:43:33.1029341Z * [new tag] viable/strict/1759785815 -> viable/strict/1759785815 2025-12-04T09:43:33.1030842Z * [new tag] viable/strict/1759789459 -> viable/strict/1759789459 2025-12-04T09:43:33.1032186Z * [new tag] viable/strict/1759790974 -> viable/strict/1759790974 2025-12-04T09:43:33.1033370Z * [new tag] viable/strict/1759794583 -> viable/strict/1759794583 2025-12-04T09:43:33.1034636Z * [new tag] viable/strict/1759797408 -> viable/strict/1759797408 2025-12-04T09:43:33.1036021Z * [new tag] viable/strict/1759799518 -> viable/strict/1759799518 2025-12-04T09:43:33.1037378Z * [new tag] viable/strict/1759804909 -> viable/strict/1759804909 2025-12-04T09:43:33.1038733Z * [new tag] viable/strict/1759807643 -> viable/strict/1759807643 2025-12-04T09:43:33.1040192Z * [new tag] viable/strict/1759809089 -> viable/strict/1759809089 2025-12-04T09:43:33.1041558Z * [new tag] viable/strict/1759811145 -> viable/strict/1759811145 2025-12-04T09:43:33.1042938Z * [new tag] viable/strict/1759812581 -> viable/strict/1759812581 2025-12-04T09:43:33.1044288Z * [new tag] viable/strict/1759814683 -> viable/strict/1759814683 2025-12-04T09:43:33.1045744Z * [new tag] viable/strict/1759821889 -> viable/strict/1759821889 2025-12-04T09:43:33.1047157Z * [new tag] viable/strict/1759823376 -> viable/strict/1759823376 2025-12-04T09:43:33.1048587Z * [new tag] viable/strict/1759827107 -> viable/strict/1759827107 2025-12-04T09:43:33.1050085Z * [new tag] viable/strict/1759830577 -> viable/strict/1759830577 2025-12-04T09:43:33.1051495Z * [new tag] viable/strict/1759832720 -> viable/strict/1759832720 2025-12-04T09:43:33.1052857Z * [new tag] viable/strict/1759842063 -> viable/strict/1759842063 2025-12-04T09:43:33.1054207Z * [new tag] viable/strict/1759847121 -> viable/strict/1759847121 2025-12-04T09:43:33.1055858Z * [new tag] viable/strict/1759850721 -> viable/strict/1759850721 2025-12-04T09:43:33.1057496Z * [new tag] viable/strict/1759857870 -> viable/strict/1759857870 2025-12-04T09:43:33.1058878Z * [new tag] viable/strict/1759863143 -> viable/strict/1759863143 2025-12-04T09:43:33.1060273Z * [new tag] viable/strict/1759875874 -> viable/strict/1759875874 2025-12-04T09:43:33.1061536Z * [new tag] viable/strict/1759877385 -> viable/strict/1759877385 2025-12-04T09:43:33.1062891Z * [new tag] viable/strict/1759883801 -> viable/strict/1759883801 2025-12-04T09:43:33.1064464Z * [new tag] viable/strict/1759885922 -> viable/strict/1759885922 2025-12-04T09:43:33.1065711Z * [new tag] viable/strict/1759888488 -> viable/strict/1759888488 2025-12-04T09:43:33.1067086Z * [new tag] viable/strict/1759895471 -> viable/strict/1759895471 2025-12-04T09:43:33.1068613Z * [new tag] viable/strict/1759904803 -> viable/strict/1759904803 2025-12-04T09:43:33.1070175Z * [new tag] viable/strict/1759908300 -> viable/strict/1759908300 2025-12-04T09:43:33.1071552Z * [new tag] viable/strict/1759915520 -> viable/strict/1759915520 2025-12-04T09:43:33.1072932Z * [new tag] viable/strict/1759916978 -> viable/strict/1759916978 2025-12-04T09:43:33.1074189Z * [new tag] viable/strict/1759930024 -> viable/strict/1759930024 2025-12-04T09:43:33.1075547Z * [new tag] viable/strict/1759948122 -> viable/strict/1759948122 2025-12-04T09:43:33.1076979Z * [new tag] viable/strict/1759952983 -> viable/strict/1759952983 2025-12-04T09:43:33.1078402Z * [new tag] viable/strict/1759955121 -> viable/strict/1759955121 2025-12-04T09:43:33.1079783Z * [new tag] viable/strict/1759962298 -> viable/strict/1759962298 2025-12-04T09:43:33.1081154Z * [new tag] viable/strict/1759965837 -> viable/strict/1759965837 2025-12-04T09:43:33.1082608Z * [new tag] viable/strict/1759970213 -> viable/strict/1759970213 2025-12-04T09:43:33.1084009Z * [new tag] viable/strict/1759974894 -> viable/strict/1759974894 2025-12-04T09:43:33.1085346Z * [new tag] viable/strict/1759977763 -> viable/strict/1759977763 2025-12-04T09:43:33.1086808Z * [new tag] viable/strict/1759979241 -> viable/strict/1759979241 2025-12-04T09:43:33.1088163Z * [new tag] viable/strict/1759985417 -> viable/strict/1759985417 2025-12-04T09:43:33.1089535Z * [new tag] viable/strict/1759987490 -> viable/strict/1759987490 2025-12-04T09:43:33.1091003Z * [new tag] viable/strict/1759996180 -> viable/strict/1759996180 2025-12-04T09:43:33.1092408Z * [new tag] viable/strict/1760065682 -> viable/strict/1760065682 2025-12-04T09:43:33.1093840Z * [new tag] viable/strict/1760066894 -> viable/strict/1760066894 2025-12-04T09:43:33.1095222Z * [new tag] viable/strict/1760070345 -> viable/strict/1760070345 2025-12-04T09:43:33.1096650Z * [new tag] viable/strict/1760089782 -> viable/strict/1760089782 2025-12-04T09:43:33.1098032Z * [new tag] viable/strict/1760091921 -> viable/strict/1760091921 2025-12-04T09:43:33.1099497Z * [new tag] viable/strict/1760127924 -> viable/strict/1760127924 2025-12-04T09:43:33.1100888Z * [new tag] viable/strict/1760129489 -> viable/strict/1760129489 2025-12-04T09:43:33.1102342Z * [new tag] viable/strict/1760132980 -> viable/strict/1760132980 2025-12-04T09:43:33.1104207Z * [new tag] viable/strict/1760135060 -> viable/strict/1760135060 2025-12-04T09:43:33.1105557Z * [new tag] viable/strict/1760215782 -> viable/strict/1760215782 2025-12-04T09:43:33.1107017Z * [new tag] viable/strict/1760273849 -> viable/strict/1760273849 2025-12-04T09:43:33.1108499Z * [new tag] viable/strict/1760275517 -> viable/strict/1760275517 2025-12-04T09:43:33.1109909Z * [new tag] viable/strict/1760276979 -> viable/strict/1760276979 2025-12-04T09:43:33.1111274Z * [new tag] viable/strict/1760279007 -> viable/strict/1760279007 2025-12-04T09:43:33.1112555Z * [new tag] viable/strict/1760286328 -> viable/strict/1760286328 2025-12-04T09:43:33.1113846Z * [new tag] viable/strict/1760493304 -> viable/strict/1760493304 2025-12-04T09:43:33.1115311Z * [new tag] viable/strict/1760496298 -> viable/strict/1760496298 2025-12-04T09:43:33.1116619Z * [new tag] viable/strict/1760518396 -> viable/strict/1760518396 2025-12-04T09:43:33.1117950Z * [new tag] viable/strict/1760534864 -> viable/strict/1760534864 2025-12-04T09:43:33.1119342Z * [new tag] viable/strict/1760549062 -> viable/strict/1760549062 2025-12-04T09:43:33.1120883Z * [new tag] viable/strict/1760552799 -> viable/strict/1760552799 2025-12-04T09:43:33.1122282Z * [new tag] viable/strict/1760554355 -> viable/strict/1760554355 2025-12-04T09:43:33.1123698Z * [new tag] viable/strict/1760556275 -> viable/strict/1760556275 2025-12-04T09:43:33.1125087Z * [new tag] viable/strict/1760564979 -> viable/strict/1760564979 2025-12-04T09:43:33.1126548Z * [new tag] viable/strict/1760567049 -> viable/strict/1760567049 2025-12-04T09:43:33.1128225Z * [new tag] viable/strict/1760568585 -> viable/strict/1760568585 2025-12-04T09:43:33.1129592Z * [new tag] viable/strict/1760570630 -> viable/strict/1760570630 2025-12-04T09:43:33.1130981Z * [new tag] viable/strict/1760572180 -> viable/strict/1760572180 2025-12-04T09:43:33.1132404Z * [new tag] viable/strict/1760575094 -> viable/strict/1760575094 2025-12-04T09:43:33.1133862Z * [new tag] viable/strict/1760579709 -> viable/strict/1760579709 2025-12-04T09:43:33.1135627Z * [new tag] viable/strict/1760582614 -> viable/strict/1760582614 2025-12-04T09:43:33.1137121Z * [new tag] viable/strict/1760586815 -> viable/strict/1760586815 2025-12-04T09:43:33.1138406Z * [new tag] viable/strict/1760588829 -> viable/strict/1760588829 2025-12-04T09:43:33.1151108Z * [new tag] viable/strict/1760590200 -> viable/strict/1760590200 2025-12-04T09:43:33.1151547Z * [new tag] viable/strict/1760592311 -> viable/strict/1760592311 2025-12-04T09:43:33.1151862Z * [new tag] viable/strict/1760619733 -> viable/strict/1760619733 2025-12-04T09:43:33.1152177Z * [new tag] viable/strict/1760628335 -> viable/strict/1760628335 2025-12-04T09:43:33.1152352Z * [new tag] viable/strict/1760635490 -> viable/strict/1760635490 2025-12-04T09:43:33.1152517Z * [new tag] viable/strict/1760640743 -> viable/strict/1760640743 2025-12-04T09:43:33.1152676Z * [new tag] viable/strict/1760642528 -> viable/strict/1760642528 2025-12-04T09:43:33.1152829Z * [new tag] viable/strict/1760646330 -> viable/strict/1760646330 2025-12-04T09:43:33.1153090Z * [new tag] viable/strict/1760666101 -> viable/strict/1760666101 2025-12-04T09:43:33.1153401Z * [new tag] viable/strict/1760668990 -> viable/strict/1760668990 2025-12-04T09:43:33.1153662Z * [new tag] viable/strict/1760670600 -> viable/strict/1760670600 2025-12-04T09:43:33.1155422Z * [new tag] viable/strict/1760671704 -> viable/strict/1760671704 2025-12-04T09:43:33.1156906Z * [new tag] viable/strict/1760673121 -> viable/strict/1760673121 2025-12-04T09:43:33.1158189Z * [new tag] viable/strict/1760675352 -> viable/strict/1760675352 2025-12-04T09:43:33.1159824Z * [new tag] viable/strict/1760696731 -> viable/strict/1760696731 2025-12-04T09:43:33.1162193Z * [new tag] viable/strict/1760723515 -> viable/strict/1760723515 2025-12-04T09:43:33.1163545Z * [new tag] viable/strict/1760727234 -> viable/strict/1760727234 2025-12-04T09:43:33.1164957Z * [new tag] viable/strict/1760730578 -> viable/strict/1760730578 2025-12-04T09:43:33.1166329Z * [new tag] viable/strict/1760732726 -> viable/strict/1760732726 2025-12-04T09:43:33.1167849Z * [new tag] viable/strict/1760734180 -> viable/strict/1760734180 2025-12-04T09:43:33.1169171Z * [new tag] viable/strict/1760736251 -> viable/strict/1760736251 2025-12-04T09:43:33.1170520Z * [new tag] viable/strict/1760737772 -> viable/strict/1760737772 2025-12-04T09:43:33.1171950Z * [new tag] viable/strict/1760758005 -> viable/strict/1760758005 2025-12-04T09:43:33.1173137Z * [new tag] viable/strict/1760761532 -> viable/strict/1760761532 2025-12-04T09:43:33.1174608Z * [new tag] viable/strict/1760802581 -> viable/strict/1760802581 2025-12-04T09:43:33.1176041Z * [new tag] viable/strict/1760827772 -> viable/strict/1760827772 2025-12-04T09:43:33.1177456Z * [new tag] viable/strict/1760834524 -> viable/strict/1760834524 2025-12-04T09:43:33.1178848Z * [new tag] viable/strict/1760845009 -> viable/strict/1760845009 2025-12-04T09:43:33.1180240Z * [new tag] viable/strict/1760876836 -> viable/strict/1760876836 2025-12-04T09:43:33.1181646Z * [new tag] viable/strict/1760880329 -> viable/strict/1760880329 2025-12-04T09:43:33.1183109Z * [new tag] viable/strict/1760888987 -> viable/strict/1760888987 2025-12-04T09:43:33.1184427Z * [new tag] viable/strict/1760912664 -> viable/strict/1760912664 2025-12-04T09:43:33.1185828Z * [new tag] viable/strict/1760925321 -> viable/strict/1760925321 2025-12-04T09:43:33.1187205Z * [new tag] viable/strict/1760931488 -> viable/strict/1760931488 2025-12-04T09:43:33.1188758Z * [new tag] viable/strict/1760932693 -> viable/strict/1760932693 2025-12-04T09:43:33.1190114Z * [new tag] viable/strict/1761004184 -> viable/strict/1761004184 2025-12-04T09:43:33.1191493Z * [new tag] viable/strict/1761014748 -> viable/strict/1761014748 2025-12-04T09:43:33.1192927Z * [new tag] viable/strict/1761017491 -> viable/strict/1761017491 2025-12-04T09:43:33.1194381Z * [new tag] viable/strict/1761018806 -> viable/strict/1761018806 2025-12-04T09:43:33.1196410Z * [new tag] viable/strict/1761020754 -> viable/strict/1761020754 2025-12-04T09:43:33.1197858Z * [new tag] viable/strict/1761024303 -> viable/strict/1761024303 2025-12-04T09:43:33.1199267Z * [new tag] viable/strict/1761029582 -> viable/strict/1761029582 2025-12-04T09:43:33.1200675Z * [new tag] viable/strict/1761031535 -> viable/strict/1761031535 2025-12-04T09:43:33.1201998Z * [new tag] viable/strict/1761035196 -> viable/strict/1761035196 2025-12-04T09:43:33.1203486Z * [new tag] viable/strict/1761045825 -> viable/strict/1761045825 2025-12-04T09:43:33.1204934Z * [new tag] viable/strict/1761054796 -> viable/strict/1761054796 2025-12-04T09:43:33.1206315Z * [new tag] viable/strict/1761060314 -> viable/strict/1761060314 2025-12-04T09:43:33.1207729Z * [new tag] viable/strict/1761071198 -> viable/strict/1761071198 2025-12-04T09:43:33.1209207Z * [new tag] viable/strict/1761074628 -> viable/strict/1761074628 2025-12-04T09:43:33.1210604Z * [new tag] viable/strict/1761078351 -> viable/strict/1761078351 2025-12-04T09:43:33.1211970Z * [new tag] viable/strict/1761079822 -> viable/strict/1761079822 2025-12-04T09:43:33.1213320Z * [new tag] viable/strict/1761081873 -> viable/strict/1761081873 2025-12-04T09:43:33.1214725Z * [new tag] viable/strict/1761083392 -> viable/strict/1761083392 2025-12-04T09:43:33.1216171Z * [new tag] viable/strict/1761085465 -> viable/strict/1761085465 2025-12-04T09:43:33.1217586Z * [new tag] viable/strict/1761089099 -> viable/strict/1761089099 2025-12-04T09:43:33.1219039Z * [new tag] viable/strict/1761095535 -> viable/strict/1761095535 2025-12-04T09:43:33.1220346Z * [new tag] viable/strict/1761098119 -> viable/strict/1761098119 2025-12-04T09:43:33.1222037Z * [new tag] viable/strict/1761101330 -> viable/strict/1761101330 2025-12-04T09:43:33.1223471Z * [new tag] viable/strict/1761114425 -> viable/strict/1761114425 2025-12-04T09:43:33.1224938Z * [new tag] viable/strict/1761116036 -> viable/strict/1761116036 2025-12-04T09:43:33.1226348Z * [new tag] viable/strict/1761119379 -> viable/strict/1761119379 2025-12-04T09:43:33.1227820Z * [new tag] viable/strict/1761121601 -> viable/strict/1761121601 2025-12-04T09:43:33.1229273Z * [new tag] viable/strict/1761123234 -> viable/strict/1761123234 2025-12-04T09:43:33.1230674Z * [new tag] viable/strict/1761126621 -> viable/strict/1761126621 2025-12-04T09:43:33.1232014Z * [new tag] viable/strict/1761132259 -> viable/strict/1761132259 2025-12-04T09:43:33.1233464Z * [new tag] viable/strict/1761146746 -> viable/strict/1761146746 2025-12-04T09:43:33.1234835Z * [new tag] viable/strict/1761164752 -> viable/strict/1761164752 2025-12-04T09:43:33.1236186Z * [new tag] viable/strict/1761166198 -> viable/strict/1761166198 2025-12-04T09:43:33.1237570Z * [new tag] viable/strict/1761175424 -> viable/strict/1761175424 2025-12-04T09:43:33.1238968Z * [new tag] viable/strict/1761176983 -> viable/strict/1761176983 2025-12-04T09:43:33.1240477Z * [new tag] viable/strict/1761179891 -> viable/strict/1761179891 2025-12-04T09:43:33.1241947Z * [new tag] viable/strict/1761181930 -> viable/strict/1761181930 2025-12-04T09:43:33.1243331Z * [new tag] viable/strict/1761184516 -> viable/strict/1761184516 2025-12-04T09:43:33.1244769Z * [new tag] viable/strict/1761190179 -> viable/strict/1761190179 2025-12-04T09:43:33.1246173Z * [new tag] viable/strict/1761193558 -> viable/strict/1761193558 2025-12-04T09:43:33.1247567Z * [new tag] viable/strict/1761207990 -> viable/strict/1761207990 2025-12-04T09:43:33.1248918Z * [new tag] viable/strict/1761229539 -> viable/strict/1761229539 2025-12-04T09:43:33.1250508Z * [new tag] viable/strict/1761244031 -> viable/strict/1761244031 2025-12-04T09:43:33.1251885Z * [new tag] viable/strict/1761248986 -> viable/strict/1761248986 2025-12-04T09:43:33.1253307Z * [new tag] viable/strict/1761259791 -> viable/strict/1761259791 2025-12-04T09:43:33.1254651Z * [new tag] viable/strict/1761266139 -> viable/strict/1761266139 2025-12-04T09:43:33.1256297Z * [new tag] viable/strict/1761268316 -> viable/strict/1761268316 2025-12-04T09:43:33.1257740Z * [new tag] viable/strict/1761273805 -> viable/strict/1761273805 2025-12-04T09:43:33.1259173Z * [new tag] viable/strict/1761275261 -> viable/strict/1761275261 2025-12-04T09:43:33.1260611Z * [new tag] viable/strict/1761277913 -> viable/strict/1761277913 2025-12-04T09:43:33.1262062Z * [new tag] viable/strict/1761290701 -> viable/strict/1761290701 2025-12-04T09:43:33.1263478Z * [new tag] viable/strict/1761294396 -> viable/strict/1761294396 2025-12-04T09:43:33.1264837Z * [new tag] viable/strict/1761303047 -> viable/strict/1761303047 2025-12-04T09:43:33.1266227Z * [new tag] viable/strict/1761335388 -> viable/strict/1761335388 2025-12-04T09:43:33.1267732Z * [new tag] viable/strict/1761337551 -> viable/strict/1761337551 2025-12-04T09:43:33.1269227Z * [new tag] viable/strict/1761339007 -> viable/strict/1761339007 2025-12-04T09:43:33.1270509Z * [new tag] viable/strict/1761341050 -> viable/strict/1761341050 2025-12-04T09:43:33.1271918Z * [new tag] viable/strict/1761346188 -> viable/strict/1761346188 2025-12-04T09:43:33.1273425Z * [new tag] viable/strict/1761349792 -> viable/strict/1761349792 2025-12-04T09:43:33.1274847Z * [new tag] viable/strict/1761352620 -> viable/strict/1761352620 2025-12-04T09:43:33.1276337Z * [new tag] viable/strict/1761354730 -> viable/strict/1761354730 2025-12-04T09:43:33.1277666Z * [new tag] viable/strict/1761357298 -> viable/strict/1761357298 2025-12-04T09:43:33.1279126Z * [new tag] viable/strict/1761360201 -> viable/strict/1761360201 2025-12-04T09:43:33.1280489Z * [new tag] viable/strict/1761361753 -> viable/strict/1761361753 2025-12-04T09:43:33.1281881Z * [new tag] viable/strict/1761364351 -> viable/strict/1761364351 2025-12-04T09:43:33.1283224Z * [new tag] viable/strict/1761366338 -> viable/strict/1761366338 2025-12-04T09:43:33.1284724Z * [new tag] viable/strict/1761367802 -> viable/strict/1761367802 2025-12-04T09:43:33.1286111Z * [new tag] viable/strict/1761369889 -> viable/strict/1761369889 2025-12-04T09:43:33.1287913Z * [new tag] viable/strict/1761371385 -> viable/strict/1761371385 2025-12-04T09:43:33.1289371Z * [new tag] viable/strict/1761373581 -> viable/strict/1761373581 2025-12-04T09:43:33.1290919Z * [new tag] viable/strict/1761375054 -> viable/strict/1761375054 2025-12-04T09:43:33.1292289Z * [new tag] viable/strict/1761421785 -> viable/strict/1761421785 2025-12-04T09:43:33.1293820Z * [new tag] viable/strict/1761434614 -> viable/strict/1761434614 2025-12-04T09:43:33.1295513Z * [new tag] viable/strict/1761439254 -> viable/strict/1761439254 2025-12-04T09:43:33.1297006Z * [new tag] viable/strict/1761454187 -> viable/strict/1761454187 2025-12-04T09:43:33.1298473Z * [new tag] viable/strict/1761459991 -> viable/strict/1761459991 2025-12-04T09:43:33.1299995Z * [new tag] viable/strict/1761470668 -> viable/strict/1761470668 2025-12-04T09:43:33.1301729Z * [new tag] viable/strict/1761472188 -> viable/strict/1761472188 2025-12-04T09:43:33.1303209Z * [new tag] viable/strict/1761503178 -> viable/strict/1761503178 2025-12-04T09:43:33.1304606Z * [new tag] viable/strict/1761517492 -> viable/strict/1761517492 2025-12-04T09:43:33.1306027Z * [new tag] viable/strict/1761518981 -> viable/strict/1761518981 2025-12-04T09:43:33.1307571Z * [new tag] viable/strict/1761533609 -> viable/strict/1761533609 2025-12-04T09:43:33.1308975Z * [new tag] viable/strict/1761546438 -> viable/strict/1761546438 2025-12-04T09:43:33.1310426Z * [new tag] viable/strict/1761548133 -> viable/strict/1761548133 2025-12-04T09:43:33.1311993Z * [new tag] viable/strict/1761555186 -> viable/strict/1761555186 2025-12-04T09:43:33.1313560Z * [new tag] viable/strict/1761557178 -> viable/strict/1761557178 2025-12-04T09:43:33.1314978Z * [new tag] viable/strict/1761560772 -> viable/strict/1761560772 2025-12-04T09:43:33.1316411Z * [new tag] viable/strict/1761562266 -> viable/strict/1761562266 2025-12-04T09:43:33.1317817Z * [new tag] viable/strict/1761564260 -> viable/strict/1761564260 2025-12-04T09:43:33.1319307Z * [new tag] viable/strict/1761568072 -> viable/strict/1761568072 2025-12-04T09:43:33.1320689Z * [new tag] viable/strict/1761571683 -> viable/strict/1761571683 2025-12-04T09:43:33.1322176Z * [new tag] viable/strict/1761580199 -> viable/strict/1761580199 2025-12-04T09:43:33.1323563Z * [new tag] viable/strict/1761587383 -> viable/strict/1761587383 2025-12-04T09:43:33.1325043Z * [new tag] viable/strict/1761591165 -> viable/strict/1761591165 2025-12-04T09:43:33.1326394Z * [new tag] viable/strict/1761594575 -> viable/strict/1761594575 2025-12-04T09:43:33.1327805Z * [new tag] viable/strict/1761596710 -> viable/strict/1761596710 2025-12-04T09:43:33.1329221Z * [new tag] viable/strict/1761598189 -> viable/strict/1761598189 2025-12-04T09:43:33.1330710Z * [new tag] viable/strict/1761600254 -> viable/strict/1761600254 2025-12-04T09:43:33.1332075Z * [new tag] viable/strict/1761603879 -> viable/strict/1761603879 2025-12-04T09:43:33.1333512Z * [new tag] viable/strict/1761605429 -> viable/strict/1761605429 2025-12-04T09:43:33.1335016Z * [new tag] viable/strict/1761607468 -> viable/strict/1761607468 2025-12-04T09:43:33.1336514Z * [new tag] viable/strict/1761608983 -> viable/strict/1761608983 2025-12-04T09:43:33.1337930Z * [new tag] viable/strict/1761611846 -> viable/strict/1761611846 2025-12-04T09:43:33.1339386Z * [new tag] viable/strict/1761613922 -> viable/strict/1761613922 2025-12-04T09:43:33.1340701Z * [new tag] viable/strict/1761616504 -> viable/strict/1761616504 2025-12-04T09:43:33.1342190Z * [new tag] viable/strict/1761619599 -> viable/strict/1761619599 2025-12-04T09:43:33.1343566Z * [new tag] viable/strict/1761686693 -> viable/strict/1761686693 2025-12-04T09:43:33.1344957Z * [new tag] viable/strict/1761688179 -> viable/strict/1761688179 2025-12-04T09:43:33.1346370Z * [new tag] viable/strict/1761691973 -> viable/strict/1761691973 2025-12-04T09:43:33.1348094Z * [new tag] viable/strict/1761693884 -> viable/strict/1761693884 2025-12-04T09:43:33.1349429Z * [new tag] viable/strict/1761695389 -> viable/strict/1761695389 2025-12-04T09:43:33.1350859Z * [new tag] viable/strict/1761698408 -> viable/strict/1761698408 2025-12-04T09:43:33.1352280Z * [new tag] viable/strict/1761702931 -> viable/strict/1761702931 2025-12-04T09:43:33.1353689Z * [new tag] viable/strict/1761706307 -> viable/strict/1761706307 2025-12-04T09:43:33.1355128Z * [new tag] viable/strict/1761709065 -> viable/strict/1761709065 2025-12-04T09:43:33.1356943Z * [new tag] viable/strict/1761710285 -> viable/strict/1761710285 2025-12-04T09:43:33.1358429Z * [new tag] viable/strict/1761711983 -> viable/strict/1761711983 2025-12-04T09:43:33.1359887Z * [new tag] viable/strict/1761713514 -> viable/strict/1761713514 2025-12-04T09:43:33.1361412Z * [new tag] viable/strict/1761715523 -> viable/strict/1761715523 2025-12-04T09:43:33.1362903Z * [new tag] viable/strict/1761727973 -> viable/strict/1761727973 2025-12-04T09:43:33.1364400Z * [new tag] viable/strict/1761751558 -> viable/strict/1761751558 2025-12-04T09:43:33.1365883Z * [new tag] viable/strict/1761755187 -> viable/strict/1761755187 2025-12-04T09:43:33.1367359Z * [new tag] viable/strict/1761756826 -> viable/strict/1761756826 2025-12-04T09:43:33.1368996Z * [new tag] viable/strict/1761769551 -> viable/strict/1761769551 2025-12-04T09:43:33.1370546Z * [new tag] viable/strict/1761771032 -> viable/strict/1761771032 2025-12-04T09:43:33.1371886Z * [new tag] viable/strict/1761773101 -> viable/strict/1761773101 2025-12-04T09:43:33.1373338Z * [new tag] viable/strict/1761781792 -> viable/strict/1761781792 2025-12-04T09:43:33.1374951Z * [new tag] viable/strict/1761784788 -> viable/strict/1761784788 2025-12-04T09:43:33.1376316Z * [new tag] viable/strict/1761786740 -> viable/strict/1761786740 2025-12-04T09:43:33.1377717Z * [new tag] viable/strict/1761789332 -> viable/strict/1761789332 2025-12-04T09:43:33.1379522Z * [new tag] viable/strict/1761792569 -> viable/strict/1761792569 2025-12-04T09:43:33.1381004Z * [new tag] viable/strict/1761795289 -> viable/strict/1761795289 2025-12-04T09:43:33.1382895Z * [new tag] viable/strict/1761798345 -> viable/strict/1761798345 2025-12-04T09:43:33.1384364Z * [new tag] viable/strict/1761799827 -> viable/strict/1761799827 2025-12-04T09:43:33.1385828Z * [new tag] viable/strict/1761805604 -> viable/strict/1761805604 2025-12-04T09:43:33.1387344Z * [new tag] viable/strict/1761807202 -> viable/strict/1761807202 2025-12-04T09:43:33.1388885Z * [new tag] viable/strict/1761809094 -> viable/strict/1761809094 2025-12-04T09:43:33.1390317Z * [new tag] viable/strict/1761810576 -> viable/strict/1761810576 2025-12-04T09:43:33.1391765Z * [new tag] viable/strict/1761812771 -> viable/strict/1761812771 2025-12-04T09:43:33.1393263Z * [new tag] viable/strict/1761814363 -> viable/strict/1761814363 2025-12-04T09:43:33.1394711Z * [new tag] viable/strict/1761857410 -> viable/strict/1761857410 2025-12-04T09:43:33.1396183Z * [new tag] viable/strict/1761860985 -> viable/strict/1761860985 2025-12-04T09:43:33.1397623Z * [new tag] viable/strict/1761863094 -> viable/strict/1761863094 2025-12-04T09:43:33.1399074Z * [new tag] viable/strict/1761864590 -> viable/strict/1761864590 2025-12-04T09:43:33.1400540Z * [new tag] viable/strict/1761866675 -> viable/strict/1761866675 2025-12-04T09:43:33.1402144Z * [new tag] viable/strict/1761868178 -> viable/strict/1761868178 2025-12-04T09:43:33.1403690Z * [new tag] viable/strict/1761871111 -> viable/strict/1761871111 2025-12-04T09:43:33.1405141Z * [new tag] viable/strict/1761873126 -> viable/strict/1761873126 2025-12-04T09:43:33.1406607Z * [new tag] viable/strict/1761875714 -> viable/strict/1761875714 2025-12-04T09:43:33.1408320Z * [new tag] viable/strict/1761878924 -> viable/strict/1761878924 2025-12-04T09:43:33.1410701Z * [new tag] viable/strict/1761881727 -> viable/strict/1761881727 2025-12-04T09:43:33.1411457Z * [new tag] viable/strict/1761882959 -> viable/strict/1761882959 2025-12-04T09:43:33.1412410Z * [new tag] viable/strict/1761886268 -> viable/strict/1761886268 2025-12-04T09:43:33.1413952Z * [new tag] viable/strict/1761893641 -> viable/strict/1761893641 2025-12-04T09:43:33.1415396Z * [new tag] viable/strict/1761931517 -> viable/strict/1761931517 2025-12-04T09:43:33.1416884Z * [new tag] viable/strict/1761933080 -> viable/strict/1761933080 2025-12-04T09:43:33.1418366Z * [new tag] viable/strict/1761935217 -> viable/strict/1761935217 2025-12-04T09:43:33.1419885Z * [new tag] viable/strict/1761938533 -> viable/strict/1761938533 2025-12-04T09:43:33.1421314Z * [new tag] viable/strict/1761940184 -> viable/strict/1761940184 2025-12-04T09:43:33.1422775Z * [new tag] viable/strict/1761942338 -> viable/strict/1761942338 2025-12-04T09:43:33.1424204Z * [new tag] viable/strict/1761946100 -> viable/strict/1761946100 2025-12-04T09:43:33.1425736Z * [new tag] viable/strict/1761947374 -> viable/strict/1761947374 2025-12-04T09:43:33.1427190Z * [new tag] viable/strict/1761950978 -> viable/strict/1761950978 2025-12-04T09:43:33.1428845Z * [new tag] viable/strict/1761957727 -> viable/strict/1761957727 2025-12-04T09:43:33.1430228Z * [new tag] viable/strict/1761959532 -> viable/strict/1761959532 2025-12-04T09:43:33.1431752Z * [new tag] viable/strict/1761965366 -> viable/strict/1761965366 2025-12-04T09:43:33.1433282Z * [new tag] viable/strict/1761968066 -> viable/strict/1761968066 2025-12-04T09:43:33.1434781Z * [new tag] viable/strict/1761969322 -> viable/strict/1761969322 2025-12-04T09:43:33.1436294Z * [new tag] viable/strict/1761974723 -> viable/strict/1761974723 2025-12-04T09:43:33.1437833Z * [new tag] viable/strict/1761981837 -> viable/strict/1761981837 2025-12-04T09:43:33.1439295Z * [new tag] viable/strict/1761985546 -> viable/strict/1761985546 2025-12-04T09:43:33.1440776Z * [new tag] viable/strict/1761987030 -> viable/strict/1761987030 2025-12-04T09:43:33.1442294Z * [new tag] viable/strict/1762003554 -> viable/strict/1762003554 2025-12-04T09:43:33.1443838Z * [new tag] viable/strict/1762021560 -> viable/strict/1762021560 2025-12-04T09:43:33.1445245Z * [new tag] viable/strict/1762032190 -> viable/strict/1762032190 2025-12-04T09:43:33.1446762Z * [new tag] viable/strict/1762040981 -> viable/strict/1762040981 2025-12-04T09:43:33.1448269Z * [new tag] viable/strict/1762048525 -> viable/strict/1762048525 2025-12-04T09:43:33.1449729Z * [new tag] viable/strict/1762104223 -> viable/strict/1762104223 2025-12-04T09:43:33.1451158Z * [new tag] viable/strict/1762105778 -> viable/strict/1762105778 2025-12-04T09:43:33.1452620Z * [new tag] viable/strict/1762115109 -> viable/strict/1762115109 2025-12-04T09:43:33.1454072Z * [new tag] viable/strict/1762125840 -> viable/strict/1762125840 2025-12-04T09:43:33.1455516Z * [new tag] viable/strict/1762127377 -> viable/strict/1762127377 2025-12-04T09:43:33.1458619Z * [new tag] viable/strict/1762134925 -> viable/strict/1762134925 2025-12-04T09:43:33.1460046Z * [new tag] viable/strict/1762138338 -> viable/strict/1762138338 2025-12-04T09:43:33.1461524Z * [new tag] viable/strict/1762148993 -> viable/strict/1762148993 2025-12-04T09:43:33.1463010Z * [new tag] viable/strict/1762152871 -> viable/strict/1762152871 2025-12-04T09:43:33.1464540Z * [new tag] viable/strict/1762156183 -> viable/strict/1762156183 2025-12-04T09:43:33.1466047Z * [new tag] viable/strict/1762163457 -> viable/strict/1762163457 2025-12-04T09:43:33.1467619Z * [new tag] viable/strict/1762165569 -> viable/strict/1762165569 2025-12-04T09:43:33.1469050Z * [new tag] viable/strict/1762169035 -> viable/strict/1762169035 2025-12-04T09:43:33.1470550Z * [new tag] viable/strict/1762174936 -> viable/strict/1762174936 2025-12-04T09:43:33.1471975Z * [new tag] viable/strict/1762194412 -> viable/strict/1762194412 2025-12-04T09:43:33.1473437Z * [new tag] viable/strict/1762195876 -> viable/strict/1762195876 2025-12-04T09:43:33.1474874Z * [new tag] viable/strict/1762197788 -> viable/strict/1762197788 2025-12-04T09:43:33.1476369Z * [new tag] viable/strict/1762199389 -> viable/strict/1762199389 2025-12-04T09:43:33.1477943Z * [new tag] viable/strict/1762206585 -> viable/strict/1762206585 2025-12-04T09:43:33.1479929Z * [new tag] viable/strict/1762210184 -> viable/strict/1762210184 2025-12-04T09:43:33.1481310Z * [new tag] viable/strict/1762218736 -> viable/strict/1762218736 2025-12-04T09:43:33.1482756Z * [new tag] viable/strict/1762224529 -> viable/strict/1762224529 2025-12-04T09:43:33.1484440Z * [new tag] viable/strict/1762227253 -> viable/strict/1762227253 2025-12-04T09:43:33.1485688Z * [new tag] viable/strict/1762228515 -> viable/strict/1762228515 2025-12-04T09:43:33.1487090Z * [new tag] viable/strict/1762230349 -> viable/strict/1762230349 2025-12-04T09:43:33.1488522Z * [new tag] viable/strict/1762231859 -> viable/strict/1762231859 2025-12-04T09:43:33.1489974Z * [new tag] viable/strict/1762233925 -> viable/strict/1762233925 2025-12-04T09:43:33.1491590Z * [new tag] viable/strict/1762237630 -> viable/strict/1762237630 2025-12-04T09:43:33.1492887Z * [new tag] viable/strict/1762253522 -> viable/strict/1762253522 2025-12-04T09:43:33.1494509Z * [new tag] viable/strict/1762278588 -> viable/strict/1762278588 2025-12-04T09:43:33.1495966Z * [new tag] viable/strict/1762284203 -> viable/strict/1762284203 2025-12-04T09:43:33.1497446Z * [new tag] viable/strict/1762289446 -> viable/strict/1762289446 2025-12-04T09:43:33.1498887Z * [new tag] viable/strict/1762291515 -> viable/strict/1762291515 2025-12-04T09:43:33.1500348Z * [new tag] viable/strict/1762295100 -> viable/strict/1762295100 2025-12-04T09:43:33.1501676Z * [new tag] viable/strict/1762296590 -> viable/strict/1762296590 2025-12-04T09:43:33.1503030Z * [new tag] viable/strict/1762300179 -> viable/strict/1762300179 2025-12-04T09:43:33.1504368Z * [new tag] viable/strict/1762303207 -> viable/strict/1762303207 2025-12-04T09:43:33.1505869Z * [new tag] viable/strict/1762386584 -> viable/strict/1762386584 2025-12-04T09:43:33.1507397Z * [new tag] viable/strict/1762391537 -> viable/strict/1762391537 2025-12-04T09:43:33.1508852Z * [new tag] viable/strict/1762394119 -> viable/strict/1762394119 2025-12-04T09:43:33.1510544Z * [new tag] viable/strict/1762397437 -> viable/strict/1762397437 2025-12-04T09:43:33.1511981Z * [new tag] viable/strict/1762400256 -> viable/strict/1762400256 2025-12-04T09:43:33.1513559Z * [new tag] viable/strict/1762401469 -> viable/strict/1762401469 2025-12-04T09:43:33.1515139Z * [new tag] viable/strict/1762408195 -> viable/strict/1762408195 2025-12-04T09:43:33.1516576Z * [new tag] viable/strict/1762410411 -> viable/strict/1762410411 2025-12-04T09:43:33.1518112Z * [new tag] viable/strict/1762417613 -> viable/strict/1762417613 2025-12-04T09:43:33.1519583Z * [new tag] viable/strict/1762419198 -> viable/strict/1762419198 2025-12-04T09:43:33.1521108Z * [new tag] viable/strict/1762422656 -> viable/strict/1762422656 2025-12-04T09:43:33.1522850Z * [new tag] viable/strict/1762424746 -> viable/strict/1762424746 2025-12-04T09:43:33.1524398Z * [new tag] viable/strict/1762446386 -> viable/strict/1762446386 2025-12-04T09:43:33.1525905Z * [new tag] viable/strict/1762449912 -> viable/strict/1762449912 2025-12-04T09:43:33.1527393Z * [new tag] viable/strict/1762457031 -> viable/strict/1762457031 2025-12-04T09:43:33.1528864Z * [new tag] viable/strict/1762462441 -> viable/strict/1762462441 2025-12-04T09:43:33.1530313Z * [new tag] viable/strict/1762467909 -> viable/strict/1762467909 2025-12-04T09:43:33.1531824Z * [new tag] viable/strict/1762471493 -> viable/strict/1762471493 2025-12-04T09:43:33.1533320Z * [new tag] viable/strict/1762475990 -> viable/strict/1762475990 2025-12-04T09:43:33.1534827Z * [new tag] viable/strict/1762477933 -> viable/strict/1762477933 2025-12-04T09:43:33.1536316Z * [new tag] viable/strict/1762491053 -> viable/strict/1762491053 2025-12-04T09:43:33.1537864Z * [new tag] viable/strict/1762493118 -> viable/strict/1762493118 2025-12-04T09:43:33.1539318Z * [new tag] viable/strict/1762498442 -> viable/strict/1762498442 2025-12-04T09:43:33.1540708Z * [new tag] viable/strict/1762501778 -> viable/strict/1762501778 2025-12-04T09:43:33.1542196Z * [new tag] viable/strict/1762504001 -> viable/strict/1762504001 2025-12-04T09:43:33.1543718Z * [new tag] viable/strict/1762505583 -> viable/strict/1762505583 2025-12-04T09:43:33.1545285Z * [new tag] viable/strict/1762507523 -> viable/strict/1762507523 2025-12-04T09:43:33.1546755Z * [new tag] viable/strict/1762511140 -> viable/strict/1762511140 2025-12-04T09:43:33.1548526Z * [new tag] viable/strict/1762512632 -> viable/strict/1762512632 2025-12-04T09:43:33.1550009Z * [new tag] viable/strict/1762520467 -> viable/strict/1762520467 2025-12-04T09:43:33.1551426Z * [new tag] viable/strict/1762522016 -> viable/strict/1762522016 2025-12-04T09:43:33.1552883Z * [new tag] viable/strict/1762530591 -> viable/strict/1762530591 2025-12-04T09:43:33.1554356Z * [new tag] viable/strict/1762543405 -> viable/strict/1762543405 2025-12-04T09:43:33.1555905Z * [new tag] viable/strict/1762544998 -> viable/strict/1762544998 2025-12-04T09:43:33.1557478Z * [new tag] viable/strict/1762552182 -> viable/strict/1762552182 2025-12-04T09:43:33.1558953Z * [new tag] viable/strict/1762554297 -> viable/strict/1762554297 2025-12-04T09:43:33.1560346Z * [new tag] viable/strict/1762559381 -> viable/strict/1762559381 2025-12-04T09:43:33.1561915Z * [new tag] viable/strict/1762562222 -> viable/strict/1762562222 2025-12-04T09:43:33.1563388Z * [new tag] viable/strict/1762564319 -> viable/strict/1762564319 2025-12-04T09:43:33.1564736Z * [new tag] viable/strict/1762566904 -> viable/strict/1762566904 2025-12-04T09:43:33.1566189Z * [new tag] viable/strict/1762569781 -> viable/strict/1762569781 2025-12-04T09:43:33.1567648Z * [new tag] viable/strict/1762575940 -> viable/strict/1762575940 2025-12-04T09:43:33.1569115Z * [new tag] viable/strict/1762580974 -> viable/strict/1762580974 2025-12-04T09:43:33.1570658Z * [new tag] viable/strict/1762583185 -> viable/strict/1762583185 2025-12-04T09:43:33.1572116Z * [new tag] viable/strict/1762586647 -> viable/strict/1762586647 2025-12-04T09:43:33.1573597Z * [new tag] viable/strict/1762588183 -> viable/strict/1762588183 2025-12-04T09:43:33.1575501Z * [new tag] viable/strict/1762593886 -> viable/strict/1762593886 2025-12-04T09:43:33.1576966Z * [new tag] viable/strict/1762650743 -> viable/strict/1762650743 2025-12-04T09:43:33.1578486Z * [new tag] viable/strict/1762653328 -> viable/strict/1762653328 2025-12-04T09:43:33.1579937Z * [new tag] viable/strict/1762659342 -> viable/strict/1762659342 2025-12-04T09:43:33.1581366Z * [new tag] viable/strict/1762662360 -> viable/strict/1762662360 2025-12-04T09:43:33.1582847Z * [new tag] viable/strict/1762667377 -> viable/strict/1762667377 2025-12-04T09:43:33.1584326Z * [new tag] viable/strict/1762671090 -> viable/strict/1762671090 2025-12-04T09:43:33.1585765Z * [new tag] viable/strict/1762680284 -> viable/strict/1762680284 2025-12-04T09:43:33.1587318Z * [new tag] viable/strict/1762683900 -> viable/strict/1762683900 2025-12-04T09:43:33.1588801Z * [new tag] viable/strict/1762705541 -> viable/strict/1762705541 2025-12-04T09:43:33.1590259Z * [new tag] viable/strict/1762709004 -> viable/strict/1762709004 2025-12-04T09:43:33.1591903Z * [new tag] viable/strict/1762746004 -> viable/strict/1762746004 2025-12-04T09:43:33.1593395Z * [new tag] viable/strict/1762748799 -> viable/strict/1762748799 2025-12-04T09:43:33.1594913Z * [new tag] viable/strict/1762759504 -> viable/strict/1762759504 2025-12-04T09:43:33.1596410Z * [new tag] viable/strict/1762760973 -> viable/strict/1762760973 2025-12-04T09:43:33.1597932Z * [new tag] viable/strict/1762775374 -> viable/strict/1762775374 2025-12-04T09:43:33.1599440Z * [new tag] viable/strict/1762777661 -> viable/strict/1762777661 2025-12-04T09:43:33.1600860Z * [new tag] viable/strict/1762779774 -> viable/strict/1762779774 2025-12-04T09:43:33.1602442Z * [new tag] viable/strict/1762781259 -> viable/strict/1762781259 2025-12-04T09:43:33.1604005Z * [new tag] viable/strict/1762793628 -> viable/strict/1762793628 2025-12-04T09:43:33.1605453Z * [new tag] viable/strict/1762800711 -> viable/strict/1762800711 2025-12-04T09:43:33.1606925Z * [new tag] viable/strict/1762809894 -> viable/strict/1762809894 2025-12-04T09:43:33.1608403Z * [new tag] viable/strict/1762811384 -> viable/strict/1762811384 2025-12-04T09:43:33.1610092Z * [new tag] viable/strict/1762813841 -> viable/strict/1762813841 2025-12-04T09:43:33.1611496Z * [new tag] viable/strict/1762815047 -> viable/strict/1762815047 2025-12-04T09:43:33.1613125Z * [new tag] viable/strict/1762817094 -> viable/strict/1762817094 2025-12-04T09:43:33.1614573Z * [new tag] viable/strict/1762818582 -> viable/strict/1762818582 2025-12-04T09:43:33.1616103Z * [new tag] viable/strict/1762821623 -> viable/strict/1762821623 2025-12-04T09:43:33.1617407Z * [new tag] viable/strict/1762823531 -> viable/strict/1762823531 2025-12-04T09:43:33.1618958Z * [new tag] viable/strict/1762849583 -> viable/strict/1762849583 2025-12-04T09:43:33.1620393Z * [new tag] viable/strict/1762851200 -> viable/strict/1762851200 2025-12-04T09:43:33.1621890Z * [new tag] viable/strict/1762854603 -> viable/strict/1762854603 2025-12-04T09:43:33.1623404Z * [new tag] viable/strict/1762858276 -> viable/strict/1762858276 2025-12-04T09:43:33.1624869Z * [new tag] viable/strict/1762860891 -> viable/strict/1762860891 2025-12-04T09:43:33.1626851Z * [new tag] viable/strict/1762866174 -> viable/strict/1762866174 2025-12-04T09:43:33.1628468Z * [new tag] viable/strict/1762867653 -> viable/strict/1762867653 2025-12-04T09:43:33.1629887Z * [new tag] viable/strict/1762872669 -> viable/strict/1762872669 2025-12-04T09:43:33.1631177Z * [new tag] viable/strict/1762878380 -> viable/strict/1762878380 2025-12-04T09:43:33.1632674Z * [new tag] viable/strict/1762889003 -> viable/strict/1762889003 2025-12-04T09:43:33.1634236Z * [new tag] viable/strict/1762890589 -> viable/strict/1762890589 2025-12-04T09:43:33.1635664Z * [new tag] viable/strict/1762892743 -> viable/strict/1762892743 2025-12-04T09:43:33.1637165Z * [new tag] viable/strict/1762894271 -> viable/strict/1762894271 2025-12-04T09:43:33.1638510Z * [new tag] viable/strict/1762896287 -> viable/strict/1762896287 2025-12-04T09:43:33.1640025Z * [new tag] viable/strict/1762915871 -> viable/strict/1762915871 2025-12-04T09:43:33.1641549Z * [new tag] viable/strict/1762918569 -> viable/strict/1762918569 2025-12-04T09:43:33.1642896Z * [new tag] viable/strict/1762919776 -> viable/strict/1762919776 2025-12-04T09:43:33.1644376Z * [new tag] viable/strict/1762923072 -> viable/strict/1762923072 2025-12-04T09:43:33.1646175Z * [new tag] viable/strict/1762928826 -> viable/strict/1762928826 2025-12-04T09:43:33.1647663Z * [new tag] viable/strict/1762930451 -> viable/strict/1762930451 2025-12-04T09:43:33.1649125Z * [new tag] viable/strict/1762933780 -> viable/strict/1762933780 2025-12-04T09:43:33.1650592Z * [new tag] viable/strict/1762937638 -> viable/strict/1762937638 2025-12-04T09:43:33.1652205Z * [new tag] viable/strict/1762939545 -> viable/strict/1762939545 2025-12-04T09:43:33.1653733Z * [new tag] viable/strict/1762962692 -> viable/strict/1762962692 2025-12-04T09:43:33.1655187Z * [new tag] viable/strict/1762979143 -> viable/strict/1762979143 2025-12-04T09:43:33.1657054Z * [new tag] viable/strict/1762984188 -> viable/strict/1762984188 2025-12-04T09:43:33.1658427Z * [new tag] viable/strict/1762986306 -> viable/strict/1762986306 2025-12-04T09:43:33.1659918Z * [new tag] viable/strict/1762989903 -> viable/strict/1762989903 2025-12-04T09:43:33.1661383Z * [new tag] viable/strict/1762991377 -> viable/strict/1762991377 2025-12-04T09:43:33.1662850Z * [new tag] viable/strict/1762998921 -> viable/strict/1762998921 2025-12-04T09:43:33.1664420Z * [new tag] viable/strict/1763002287 -> viable/strict/1763002287 2025-12-04T09:43:33.1665904Z * [new tag] viable/strict/1763016840 -> viable/strict/1763016840 2025-12-04T09:43:33.1667532Z * [new tag] viable/strict/1763020180 -> viable/strict/1763020180 2025-12-04T09:43:33.1669094Z * [new tag] viable/strict/1763027421 -> viable/strict/1763027421 2025-12-04T09:43:33.1670617Z * [new tag] viable/strict/1763031120 -> viable/strict/1763031120 2025-12-04T09:43:33.1672497Z * [new tag] viable/strict/1763036861 -> viable/strict/1763036861 2025-12-04T09:43:33.1674008Z * [new tag] viable/strict/1763038993 -> viable/strict/1763038993 2025-12-04T09:43:33.1675545Z * [new tag] viable/strict/1763054703 -> viable/strict/1763054703 2025-12-04T09:43:33.1676874Z * [new tag] viable/strict/1763067061 -> viable/strict/1763067061 2025-12-04T09:43:33.1678369Z * [new tag] viable/strict/1763070847 -> viable/strict/1763070847 2025-12-04T09:43:33.1679888Z * [new tag] viable/strict/1763072706 -> viable/strict/1763072706 2025-12-04T09:43:33.1681399Z * [new tag] viable/strict/1763076302 -> viable/strict/1763076302 2025-12-04T09:43:33.1682912Z * [new tag] viable/strict/1763080816 -> viable/strict/1763080816 2025-12-04T09:43:33.1684372Z * [new tag] viable/strict/1763082732 -> viable/strict/1763082732 2025-12-04T09:43:33.1685805Z * [new tag] viable/strict/1763085329 -> viable/strict/1763085329 2025-12-04T09:43:33.1687301Z * [new tag] viable/strict/1763088623 -> viable/strict/1763088623 2025-12-04T09:43:33.1688868Z * [new tag] viable/strict/1763091402 -> viable/strict/1763091402 2025-12-04T09:43:33.1690365Z * [new tag] viable/strict/1763092602 -> viable/strict/1763092602 2025-12-04T09:43:33.1691844Z * [new tag] viable/strict/1763094355 -> viable/strict/1763094355 2025-12-04T09:43:33.1693338Z * [new tag] viable/strict/1763099390 -> viable/strict/1763099390 2025-12-04T09:43:33.1694839Z * [new tag] viable/strict/1763101608 -> viable/strict/1763101608 2025-12-04T09:43:33.1696306Z * [new tag] viable/strict/1763105102 -> viable/strict/1763105102 2025-12-04T09:43:33.1697836Z * [new tag] viable/strict/1763112347 -> viable/strict/1763112347 2025-12-04T09:43:33.1699292Z * [new tag] viable/strict/1763119471 -> viable/strict/1763119471 2025-12-04T09:43:33.1700820Z * [new tag] viable/strict/1763126835 -> viable/strict/1763126835 2025-12-04T09:43:33.1702046Z * [new tag] viable/strict/1763149779 -> viable/strict/1763149779 2025-12-04T09:43:33.1703510Z * [new tag] viable/strict/1763164178 -> viable/strict/1763164178 2025-12-04T09:43:33.1705068Z * [new tag] viable/strict/1763167104 -> viable/strict/1763167104 2025-12-04T09:43:33.1706496Z * [new tag] viable/strict/1763169132 -> viable/strict/1763169132 2025-12-04T09:43:33.1708114Z * [new tag] viable/strict/1763171708 -> viable/strict/1763171708 2025-12-04T09:43:33.1709565Z * [new tag] viable/strict/1763174759 -> viable/strict/1763174759 2025-12-04T09:43:33.1711118Z * [new tag] viable/strict/1763180744 -> viable/strict/1763180744 2025-12-04T09:43:33.1712547Z * [new tag] viable/strict/1763182227 -> viable/strict/1763182227 2025-12-04T09:43:33.1713993Z * [new tag] viable/strict/1763184309 -> viable/strict/1763184309 2025-12-04T09:43:33.1715847Z * [new tag] viable/strict/1763187991 -> viable/strict/1763187991 2025-12-04T09:43:33.1717349Z * [new tag] viable/strict/1763191445 -> viable/strict/1763191445 2025-12-04T09:43:33.1718957Z * [new tag] viable/strict/1763195152 -> viable/strict/1763195152 2025-12-04T09:43:33.1720294Z * [new tag] viable/strict/1763205769 -> viable/strict/1763205769 2025-12-04T09:43:33.1721771Z * [new tag] viable/strict/1763246990 -> viable/strict/1763246990 2025-12-04T09:43:33.1723329Z * [new tag] viable/strict/1763261578 -> viable/strict/1763261578 2025-12-04T09:43:33.1724701Z * [new tag] viable/strict/1763286573 -> viable/strict/1763286573 2025-12-04T09:43:33.1726037Z * [new tag] viable/strict/1763292167 -> viable/strict/1763292167 2025-12-04T09:43:33.1727515Z * [new tag] viable/strict/1763333386 -> viable/strict/1763333386 2025-12-04T09:43:33.1729028Z * [new tag] viable/strict/1763340082 -> viable/strict/1763340082 2025-12-04T09:43:33.1731061Z * [new tag] viable/strict/1763364324 -> viable/strict/1763364324 2025-12-04T09:43:33.1732510Z * [new tag] viable/strict/1763371569 -> viable/strict/1763371569 2025-12-04T09:43:33.1733972Z * [new tag] viable/strict/1763373067 -> viable/strict/1763373067 2025-12-04T09:43:33.1735445Z * [new tag] viable/strict/1763375157 -> viable/strict/1763375157 2025-12-04T09:43:33.1736905Z * [new tag] viable/strict/1763382462 -> viable/strict/1763382462 2025-12-04T09:43:33.1738528Z * [new tag] viable/strict/1763394661 -> viable/strict/1763394661 2025-12-04T09:43:33.1740097Z * [new tag] viable/strict/1763396797 -> viable/strict/1763396797 2025-12-04T09:43:33.1741660Z * [new tag] viable/strict/1763398542 -> viable/strict/1763398542 2025-12-04T09:43:33.1743095Z * [new tag] viable/strict/1763401807 -> viable/strict/1763401807 2025-12-04T09:43:33.1744455Z * [new tag] viable/strict/1763414698 -> viable/strict/1763414698 2025-12-04T09:43:33.1745919Z * [new tag] viable/strict/1763419807 -> viable/strict/1763419807 2025-12-04T09:43:33.1747501Z * [new tag] viable/strict/1763426369 -> viable/strict/1763426369 2025-12-04T09:43:33.1749060Z * [new tag] viable/strict/1763428331 -> viable/strict/1763428331 2025-12-04T09:43:33.1750531Z * [new tag] viable/strict/1763430922 -> viable/strict/1763430922 2025-12-04T09:43:33.1751892Z * [new tag] viable/strict/1763434184 -> viable/strict/1763434184 2025-12-04T09:43:33.1753331Z * [new tag] viable/strict/1763439973 -> viable/strict/1763439973 2025-12-04T09:43:33.1755010Z * [new tag] viable/strict/1763444995 -> viable/strict/1763444995 2025-12-04T09:43:33.1756719Z * [new tag] viable/strict/1763447206 -> viable/strict/1763447206 2025-12-04T09:43:33.1758242Z * [new tag] viable/strict/1763448826 -> viable/strict/1763448826 2025-12-04T09:43:33.1759704Z * [new tag] viable/strict/1763450717 -> viable/strict/1763450717 2025-12-04T09:43:33.1761149Z * [new tag] viable/strict/1763452183 -> viable/strict/1763452183 2025-12-04T09:43:33.1762713Z * [new tag] viable/strict/1763457945 -> viable/strict/1763457945 2025-12-04T09:43:33.1764175Z * [new tag] viable/strict/1763459439 -> viable/strict/1763459439 2025-12-04T09:43:33.1765563Z * [new tag] viable/strict/1763461556 -> viable/strict/1763461556 2025-12-04T09:43:33.1767010Z * [new tag] viable/strict/1763463103 -> viable/strict/1763463103 2025-12-04T09:43:33.1768925Z * [new tag] viable/strict/1763465100 -> viable/strict/1763465100 2025-12-04T09:43:33.1770283Z * [new tag] viable/strict/1763468866 -> viable/strict/1763468866 2025-12-04T09:43:33.1771616Z * [new tag] viable/strict/1763493823 -> viable/strict/1763493823 2025-12-04T09:43:33.1772970Z * [new tag] viable/strict/1763496249 -> viable/strict/1763496249 2025-12-04T09:43:33.1774422Z * [new tag] viable/strict/1763502620 -> viable/strict/1763502620 2025-12-04T09:43:33.1775958Z * [new tag] viable/strict/1763504715 -> viable/strict/1763504715 2025-12-04T09:43:33.1777439Z * [new tag] viable/strict/1763506208 -> viable/strict/1763506208 2025-12-04T09:43:33.1778918Z * [new tag] viable/strict/1763520590 -> viable/strict/1763520590 2025-12-04T09:43:33.1780468Z * [new tag] viable/strict/1763523357 -> viable/strict/1763523357 2025-12-04T09:43:33.1781938Z * [new tag] viable/strict/1763529922 -> viable/strict/1763529922 2025-12-04T09:43:33.1783492Z * [new tag] viable/strict/1763531408 -> viable/strict/1763531408 2025-12-04T09:43:33.1784949Z * [new tag] viable/strict/1763533622 -> viable/strict/1763533622 2025-12-04T09:43:33.1786429Z * [new tag] viable/strict/1763538576 -> viable/strict/1763538576 2025-12-04T09:43:33.1788135Z * [new tag] viable/strict/1763545823 -> viable/strict/1763545823 2025-12-04T09:43:33.1789420Z * [new tag] viable/strict/1763547951 -> viable/strict/1763547951 2025-12-04T09:43:33.1790947Z * [new tag] viable/strict/1763551477 -> viable/strict/1763551477 2025-12-04T09:43:33.1792410Z * [new tag] viable/strict/1763552982 -> viable/strict/1763552982 2025-12-04T09:43:33.1793989Z * [new tag] viable/strict/1763594698 -> viable/strict/1763594698 2025-12-04T09:43:33.1795386Z * [new tag] viable/strict/1763596178 -> viable/strict/1763596178 2025-12-04T09:43:33.1796871Z * [new tag] viable/strict/1763599155 -> viable/strict/1763599155 2025-12-04T09:43:33.1798362Z * [new tag] viable/strict/1763603717 -> viable/strict/1763603717 2025-12-04T09:43:33.1799882Z * [new tag] viable/strict/1763606923 -> viable/strict/1763606923 2025-12-04T09:43:33.1801370Z * [new tag] viable/strict/1763609715 -> viable/strict/1763609715 2025-12-04T09:43:33.1802909Z * [new tag] viable/strict/1763612757 -> viable/strict/1763612757 2025-12-04T09:43:33.1804351Z * [new tag] viable/strict/1763616325 -> viable/strict/1763616325 2025-12-04T09:43:33.1805837Z * [new tag] viable/strict/1763623509 -> viable/strict/1763623509 2025-12-04T09:43:33.1807391Z * [new tag] viable/strict/1763624984 -> viable/strict/1763624984 2025-12-04T09:43:33.1809036Z * [new tag] viable/strict/1763628796 -> viable/strict/1763628796 2025-12-04T09:43:33.1810411Z * [new tag] viable/strict/1763634343 -> viable/strict/1763634343 2025-12-04T09:43:33.1811878Z * [new tag] viable/strict/1763635867 -> viable/strict/1763635867 2025-12-04T09:43:33.1813515Z * [new tag] viable/strict/1763639382 -> viable/strict/1763639382 2025-12-04T09:43:33.1815000Z * [new tag] viable/strict/1763646626 -> viable/strict/1763646626 2025-12-04T09:43:33.1816588Z * [new tag] viable/strict/1763655997 -> viable/strict/1763655997 2025-12-04T09:43:33.1818086Z * [new tag] viable/strict/1763659444 -> viable/strict/1763659444 2025-12-04T09:43:33.1819544Z * [new tag] viable/strict/1763660992 -> viable/strict/1763660992 2025-12-04T09:43:33.1820976Z * [new tag] viable/strict/1763663201 -> viable/strict/1763663201 2025-12-04T09:43:33.1822538Z * [new tag] viable/strict/1763670362 -> viable/strict/1763670362 2025-12-04T09:43:33.1823860Z * [new tag] viable/strict/1763675378 -> viable/strict/1763675378 2025-12-04T09:43:33.1825314Z * [new tag] viable/strict/1763693343 -> viable/strict/1763693343 2025-12-04T09:43:33.1826789Z * [new tag] viable/strict/1763696088 -> viable/strict/1763696088 2025-12-04T09:43:33.1828504Z * [new tag] viable/strict/1763697343 -> viable/strict/1763697343 2025-12-04T09:43:33.1829957Z * [new tag] viable/strict/1763699165 -> viable/strict/1763699165 2025-12-04T09:43:33.1831385Z * [new tag] viable/strict/1763700660 -> viable/strict/1763700660 2025-12-04T09:43:33.1832837Z * [new tag] viable/strict/1763704209 -> viable/strict/1763704209 2025-12-04T09:43:33.1834309Z * [new tag] viable/strict/1763706411 -> viable/strict/1763706411 2025-12-04T09:43:33.1835782Z * [new tag] viable/strict/1763708082 -> viable/strict/1763708082 2025-12-04T09:43:33.1837179Z * [new tag] viable/strict/1763711381 -> viable/strict/1763711381 2025-12-04T09:43:33.1838558Z * [new tag] viable/strict/1763713593 -> viable/strict/1763713593 2025-12-04T09:43:33.1840018Z * [new tag] viable/strict/1763715201 -> viable/strict/1763715201 2025-12-04T09:43:33.1841478Z * [new tag] viable/strict/1763733017 -> viable/strict/1763733017 2025-12-04T09:43:33.1843011Z * [new tag] viable/strict/1763735108 -> viable/strict/1763735108 2025-12-04T09:43:33.1844444Z * [new tag] viable/strict/1763749579 -> viable/strict/1763749579 2025-12-04T09:43:33.1845870Z * [new tag] viable/strict/1763751113 -> viable/strict/1763751113 2025-12-04T09:43:33.1847352Z * [new tag] viable/strict/1763753035 -> viable/strict/1763753035 2025-12-04T09:43:33.1848920Z * [new tag] viable/strict/1763754578 -> viable/strict/1763754578 2025-12-04T09:43:33.1850484Z * [new tag] viable/strict/1763756748 -> viable/strict/1763756748 2025-12-04T09:43:33.1851931Z * [new tag] viable/strict/1763758205 -> viable/strict/1763758205 2025-12-04T09:43:33.1853277Z * [new tag] viable/strict/1763764050 -> viable/strict/1763764050 2025-12-04T09:43:33.1854764Z * [new tag] viable/strict/1763771887 -> viable/strict/1763771887 2025-12-04T09:43:33.1858216Z * [new tag] viable/strict/1763773920 -> viable/strict/1763773920 2025-12-04T09:43:33.1859676Z * [new tag] viable/strict/1763776501 -> viable/strict/1763776501 2025-12-04T09:43:33.1861100Z * [new tag] viable/strict/1763779437 -> viable/strict/1763779437 2025-12-04T09:43:33.1862748Z * [new tag] viable/strict/1763781038 -> viable/strict/1763781038 2025-12-04T09:43:33.1864233Z * [new tag] viable/strict/1763782245 -> viable/strict/1763782245 2025-12-04T09:43:33.1866038Z * [new tag] viable/strict/1763785568 -> viable/strict/1763785568 2025-12-04T09:43:33.1867583Z * [new tag] viable/strict/1763787006 -> viable/strict/1763787006 2025-12-04T09:43:33.1869151Z * [new tag] viable/strict/1763789103 -> viable/strict/1763789103 2025-12-04T09:43:33.1870668Z * [new tag] viable/strict/1763790578 -> viable/strict/1763790578 2025-12-04T09:43:33.1872162Z * [new tag] viable/strict/1763796275 -> viable/strict/1763796275 2025-12-04T09:43:33.1873805Z * [new tag] viable/strict/1763801465 -> viable/strict/1763801465 2025-12-04T09:43:33.1875239Z * [new tag] viable/strict/1763803522 -> viable/strict/1763803522 2025-12-04T09:43:33.1876683Z * [new tag] viable/strict/1763808581 -> viable/strict/1763808581 2025-12-04T09:43:33.1878253Z * [new tag] viable/strict/1763840977 -> viable/strict/1763840977 2025-12-04T09:43:33.1879623Z * [new tag] viable/strict/1763846659 -> viable/strict/1763846659 2025-12-04T09:43:33.1881041Z * [new tag] viable/strict/1763872065 -> viable/strict/1763872065 2025-12-04T09:43:33.1882544Z * [new tag] viable/strict/1763873648 -> viable/strict/1763873648 2025-12-04T09:43:33.1884077Z * [new tag] viable/strict/1763875506 -> viable/strict/1763875506 2025-12-04T09:43:33.1885382Z * [new tag] viable/strict/1763889904 -> viable/strict/1763889904 2025-12-04T09:43:33.1886877Z * [new tag] viable/strict/1763930999 -> viable/strict/1763930999 2025-12-04T09:43:33.1888356Z * [new tag] viable/strict/1763944964 -> viable/strict/1763944964 2025-12-04T09:43:33.1889755Z * [new tag] viable/strict/1763958474 -> viable/strict/1763958474 2025-12-04T09:43:33.1891221Z * [new tag] viable/strict/1763967263 -> viable/strict/1763967263 2025-12-04T09:43:33.1892696Z * [new tag] viable/strict/1763972803 -> viable/strict/1763972803 2025-12-04T09:43:33.1894121Z * [new tag] viable/strict/1763976376 -> viable/strict/1763976376 2025-12-04T09:43:33.1895660Z * [new tag] viable/strict/1763989404 -> viable/strict/1763989404 2025-12-04T09:43:33.1897052Z * [new tag] viable/strict/1763990887 -> viable/strict/1763990887 2025-12-04T09:43:33.1898532Z * [new tag] viable/strict/1764019919 -> viable/strict/1764019919 2025-12-04T09:43:33.1900138Z * [new tag] viable/strict/1764023134 -> viable/strict/1764023134 2025-12-04T09:43:33.1901485Z * [new tag] viable/strict/1764024593 -> viable/strict/1764024593 2025-12-04T09:43:33.1902936Z * [new tag] viable/strict/1764026706 -> viable/strict/1764026706 2025-12-04T09:43:33.1904609Z * [new tag] viable/strict/1764031139 -> viable/strict/1764031139 2025-12-04T09:43:33.1906142Z * [new tag] viable/strict/1764033131 -> viable/strict/1764033131 2025-12-04T09:43:33.1907521Z * [new tag] viable/strict/1764035725 -> viable/strict/1764035725 2025-12-04T09:43:33.1908877Z * [new tag] viable/strict/1764624265 -> viable/strict/1764624265 2025-12-04T09:43:33.1910212Z * [new tag] viable/strict/1764631514 -> viable/strict/1764631514 2025-12-04T09:43:33.1911532Z * [new tag] viable/strict/1764632987 -> viable/strict/1764632987 2025-12-04T09:43:33.1912857Z * [new tag] viable/strict/1764636063 -> viable/strict/1764636063 2025-12-04T09:43:33.1914175Z * [new tag] viable/strict/1764643975 -> viable/strict/1764643975 2025-12-04T09:43:33.1915497Z * [new tag] viable/strict/1764646859 -> viable/strict/1764646859 2025-12-04T09:43:33.1916911Z * [new tag] viable/strict/1764653120 -> viable/strict/1764653120 2025-12-04T09:43:33.1918164Z * [new tag] viable/strict/1764654632 -> viable/strict/1764654632 2025-12-04T09:43:33.1919461Z * [new tag] viable/strict/1764656821 -> viable/strict/1764656821 2025-12-04T09:43:33.1920782Z * [new tag] viable/strict/1764658557 -> viable/strict/1764658557 2025-12-04T09:43:33.1922108Z * [new tag] viable/strict/1764660333 -> viable/strict/1764660333 2025-12-04T09:43:33.1923389Z * [new tag] viable/strict/1764661812 -> viable/strict/1764661812 2025-12-04T09:43:33.1924691Z * [new tag] viable/strict/1764664023 -> viable/strict/1764664023 2025-12-04T09:43:33.1926015Z * [new tag] viable/strict/1764669150 -> viable/strict/1764669150 2025-12-04T09:43:33.1927334Z * [new tag] viable/strict/1764680709 -> viable/strict/1764680709 2025-12-04T09:43:33.1928680Z * [new tag] viable/strict/1764687619 -> viable/strict/1764687619 2025-12-04T09:43:33.1929994Z * [new tag] viable/strict/1764696355 -> viable/strict/1764696355 2025-12-04T09:43:33.1931312Z * [new tag] viable/strict/1764701767 -> viable/strict/1764701767 2025-12-04T09:43:33.1932638Z * [new tag] viable/strict/1764710768 -> viable/strict/1764710768 2025-12-04T09:43:33.1933972Z * [new tag] viable/strict/1764716202 -> viable/strict/1764716202 2025-12-04T09:43:33.1935302Z * [new tag] viable/strict/1764793566 -> viable/strict/1764793566 2025-12-04T09:43:33.1936594Z * [new tag] viable/strict/1764797093 -> viable/strict/1764797093 2025-12-04T09:43:33.1937923Z * [new tag] viable/strict/1764800729 -> viable/strict/1764800729 2025-12-04T09:43:33.1939251Z * [new tag] whc_flight_1 -> whc_flight_1 2025-12-04T09:43:33.1940698Z * [new tag] whc_flight_2 -> whc_flight_2 2025-12-04T09:43:33.1942232Z * [new tag] whc_flight_4 -> whc_flight_4 2025-12-04T09:43:33.2988988Z [command]/usr/bin/git rev-parse --verify --quiet ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32^{object} 2025-12-04T09:43:33.3015415Z ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:43:33.3019864Z ##[endgroup] 2025-12-04T09:43:33.3020257Z ##[group]Determining the checkout info 2025-12-04T09:43:33.3021444Z ##[endgroup] 2025-12-04T09:43:33.3025562Z [command]/usr/bin/git sparse-checkout disable 2025-12-04T09:43:33.3065020Z [command]/usr/bin/git config --local --unset-all extensions.worktreeConfig 2025-12-04T09:43:33.3093556Z ##[group]Checking out the ref 2025-12-04T09:43:33.3097329Z [command]/usr/bin/git checkout --progress --force ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:43:34.3337242Z Updating files: 71% (14374/20121) 2025-12-04T09:43:34.3423672Z Updating files: 72% (14488/20121) 2025-12-04T09:43:34.3626070Z Updating files: 73% (14689/20121) 2025-12-04T09:43:34.3872102Z Updating files: 74% (14890/20121) 2025-12-04T09:43:34.4346962Z Updating files: 75% (15091/20121) 2025-12-04T09:43:34.4510388Z Updating files: 76% (15292/20121) 2025-12-04T09:43:34.4668295Z Updating files: 77% (15494/20121) 2025-12-04T09:43:34.4887673Z Updating files: 78% (15695/20121) 2025-12-04T09:43:34.5154838Z Updating files: 79% (15896/20121) 2025-12-04T09:43:34.5473287Z Updating files: 80% (16097/20121) 2025-12-04T09:43:34.5762816Z Updating files: 81% (16299/20121) 2025-12-04T09:43:34.5994443Z Updating files: 82% (16500/20121) 2025-12-04T09:43:34.6172310Z Updating files: 83% (16701/20121) 2025-12-04T09:43:34.6335110Z Updating files: 84% (16902/20121) 2025-12-04T09:43:34.6519045Z Updating files: 85% (17103/20121) 2025-12-04T09:43:34.6696097Z Updating files: 86% (17305/20121) 2025-12-04T09:43:34.6859678Z Updating files: 87% (17506/20121) 2025-12-04T09:43:34.6998856Z Updating files: 88% (17707/20121) 2025-12-04T09:43:34.7158056Z Updating files: 89% (17908/20121) 2025-12-04T09:43:34.7349346Z Updating files: 90% (18109/20121) 2025-12-04T09:43:34.7493440Z Updating files: 91% (18311/20121) 2025-12-04T09:43:34.7669410Z Updating files: 92% (18512/20121) 2025-12-04T09:43:34.7872128Z Updating files: 93% (18713/20121) 2025-12-04T09:43:34.8090099Z Updating files: 94% (18914/20121) 2025-12-04T09:43:34.8286362Z Updating files: 95% (19115/20121) 2025-12-04T09:43:34.8468239Z Updating files: 96% (19317/20121) 2025-12-04T09:43:34.8650852Z Updating files: 97% (19518/20121) 2025-12-04T09:43:34.8936852Z Updating files: 98% (19719/20121) 2025-12-04T09:43:34.9130286Z Updating files: 99% (19920/20121) 2025-12-04T09:43:34.9130554Z Updating files: 100% (20121/20121) 2025-12-04T09:43:34.9130834Z Updating files: 100% (20121/20121), done. 2025-12-04T09:43:34.9359748Z Note: switching to 'ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32'. 2025-12-04T09:43:34.9360029Z 2025-12-04T09:43:34.9360279Z You are in 'detached HEAD' state. You can look around, make experimental 2025-12-04T09:43:34.9360751Z changes and commit them, and you can discard any commits you make in this 2025-12-04T09:43:34.9361222Z state without impacting any branches by switching back to a branch. 2025-12-04T09:43:34.9361498Z 2025-12-04T09:43:34.9361686Z If you want to create a new branch to retain commits you create, you may 2025-12-04T09:43:34.9362119Z do so (now or later) by using -c with the switch command. Example: 2025-12-04T09:43:34.9362400Z 2025-12-04T09:43:34.9362508Z git switch -c 2025-12-04T09:43:34.9362693Z 2025-12-04T09:43:34.9362792Z Or undo this operation with: 2025-12-04T09:43:34.9362951Z 2025-12-04T09:43:34.9363045Z git switch - 2025-12-04T09:43:34.9363175Z 2025-12-04T09:43:34.9363388Z Turn off this advice by setting config variable advice.detachedHead to false 2025-12-04T09:43:34.9363695Z 2025-12-04T09:43:34.9367450Z HEAD is now at ffd9b0fb435 Resolve collective autotuning test failure on arm (#168919) 2025-12-04T09:43:34.9491926Z ##[endgroup] 2025-12-04T09:43:34.9492330Z ##[group]Setting up auth for fetching submodules 2025-12-04T09:43:34.9498057Z [command]/usr/bin/git config --global http.https://github.com/.extraheader AUTHORIZATION: basic *** 2025-12-04T09:43:34.9553246Z [command]/usr/bin/git config --global --unset-all url.https://github.com/.insteadOf 2025-12-04T09:43:34.9582915Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf git@github.com: 2025-12-04T09:43:34.9611847Z [command]/usr/bin/git config --global --add url.https://github.com/.insteadOf org-21003710@github.com: 2025-12-04T09:43:34.9637261Z ##[endgroup] 2025-12-04T09:43:34.9637626Z ##[group]Fetching submodules 2025-12-04T09:43:34.9641464Z [command]/usr/bin/git submodule sync --recursive 2025-12-04T09:43:35.0011862Z [command]/usr/bin/git -c protocol.version=2 submodule update --init --force --recursive 2025-12-04T09:43:35.0373964Z Submodule 'android/libs/fbjni' (https://github.com/facebookincubator/fbjni.git) registered for path 'android/libs/fbjni' 2025-12-04T09:43:35.0376301Z Submodule 'third_party/NNPACK_deps/FP16' (https://github.com/Maratyszcza/FP16.git) registered for path 'third_party/FP16' 2025-12-04T09:43:35.0379903Z Submodule 'third_party/NNPACK_deps/FXdiv' (https://github.com/Maratyszcza/FXdiv.git) registered for path 'third_party/FXdiv' 2025-12-04T09:43:35.0383530Z Submodule 'third_party/NNPACK' (https://github.com/Maratyszcza/NNPACK.git) registered for path 'third_party/NNPACK' 2025-12-04T09:43:35.0387414Z Submodule 'third_party/NVTX' (https://github.com/NVIDIA/NVTX.git) registered for path 'third_party/NVTX' 2025-12-04T09:43:35.0391890Z Submodule 'third_party/VulkanMemoryAllocator' (https://github.com/GPUOpen-LibrariesAndSDKs/VulkanMemoryAllocator.git) registered for path 'third_party/VulkanMemoryAllocator' 2025-12-04T09:43:35.0395593Z Submodule 'third_party/XNNPACK' (https://github.com/google/XNNPACK.git) registered for path 'third_party/XNNPACK' 2025-12-04T09:43:35.0399827Z Submodule 'third_party/aiter' (https://github.com/ROCm/aiter.git) registered for path 'third_party/aiter' 2025-12-04T09:43:35.0403761Z Submodule 'third_party/benchmark' (https://github.com/google/benchmark.git) registered for path 'third_party/benchmark' 2025-12-04T09:43:35.0408143Z Submodule 'third_party/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'third_party/composable_kernel' 2025-12-04T09:43:35.0412283Z Submodule 'third_party/cpp-httplib' (https://github.com/yhirose/cpp-httplib.git) registered for path 'third_party/cpp-httplib' 2025-12-04T09:43:35.0416641Z Submodule 'third_party/cpuinfo' (https://github.com/pytorch/cpuinfo.git) registered for path 'third_party/cpuinfo' 2025-12-04T09:43:35.0421333Z Submodule 'third_party/cudnn_frontend' (https://github.com/NVIDIA/cudnn-frontend.git) registered for path 'third_party/cudnn_frontend' 2025-12-04T09:43:35.0425948Z Submodule 'third_party/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'third_party/cutlass' 2025-12-04T09:43:35.0430685Z Submodule 'third_party/fbgemm' (https://github.com/pytorch/fbgemm) registered for path 'third_party/fbgemm' 2025-12-04T09:43:35.0436168Z Submodule 'third_party/flash-attention' (https://github.com/Dao-AILab/flash-attention.git) registered for path 'third_party/flash-attention' 2025-12-04T09:43:35.0443800Z Submodule 'third_party/flatbuffers' (https://github.com/google/flatbuffers.git) registered for path 'third_party/flatbuffers' 2025-12-04T09:43:35.0448727Z Submodule 'third_party/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'third_party/fmt' 2025-12-04T09:43:35.0453637Z Submodule 'third_party/gemmlowp/gemmlowp' (https://github.com/google/gemmlowp.git) registered for path 'third_party/gemmlowp/gemmlowp' 2025-12-04T09:43:35.0459023Z Submodule 'third_party/gloo' (https://github.com/pytorch/gloo) registered for path 'third_party/gloo' 2025-12-04T09:43:35.0464152Z Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/googletest' 2025-12-04T09:43:35.0469418Z Submodule 'third_party/ideep' (https://github.com/intel/ideep) registered for path 'third_party/ideep' 2025-12-04T09:43:35.0474806Z Submodule 'third_party/ittapi' (https://github.com/intel/ittapi.git) registered for path 'third_party/ittapi' 2025-12-04T09:43:35.0480097Z Submodule 'third_party/kineto' (https://github.com/pytorch/kineto) registered for path 'third_party/kineto' 2025-12-04T09:43:35.0485617Z Submodule 'third_party/kleidiai' (https://github.com/ARM-software/kleidiai.git) registered for path 'third_party/kleidiai' 2025-12-04T09:43:35.0491071Z Submodule 'third_party/mimalloc' (https://github.com/microsoft/mimalloc.git) registered for path 'third_party/mimalloc' 2025-12-04T09:43:35.0496721Z Submodule 'third_party/nlohmann' (https://github.com/nlohmann/json.git) registered for path 'third_party/nlohmann' 2025-12-04T09:43:35.0503932Z Submodule 'third_party/onnx' (https://github.com/onnx/onnx.git) registered for path 'third_party/onnx' 2025-12-04T09:43:35.0508126Z Submodule 'third_party/opentelemetry-cpp' (https://github.com/open-telemetry/opentelemetry-cpp.git) registered for path 'third_party/opentelemetry-cpp' 2025-12-04T09:43:35.0513782Z Submodule 'third_party/pocketfft' (https://github.com/mreineck/pocketfft) registered for path 'third_party/pocketfft' 2025-12-04T09:43:35.0519599Z Submodule 'third_party/protobuf' (https://github.com/protocolbuffers/protobuf.git) registered for path 'third_party/protobuf' 2025-12-04T09:43:35.0525690Z Submodule 'third_party/NNPACK_deps/psimd' (https://github.com/Maratyszcza/psimd.git) registered for path 'third_party/psimd' 2025-12-04T09:43:35.0532003Z Submodule 'third_party/NNPACK_deps/pthreadpool' (https://github.com/Maratyszcza/pthreadpool.git) registered for path 'third_party/pthreadpool' 2025-12-04T09:43:35.0541081Z Submodule 'third_party/pybind11' (https://github.com/pybind/pybind11.git) registered for path 'third_party/pybind11' 2025-12-04T09:43:35.0547602Z Submodule 'third_party/python-peachpy' (https://github.com/malfet/PeachPy.git) registered for path 'third_party/python-peachpy' 2025-12-04T09:43:35.0553569Z Submodule 'third_party/sleef' (https://github.com/shibatch/sleef) registered for path 'third_party/sleef' 2025-12-04T09:43:35.0560344Z Submodule 'third_party/tensorpipe' (https://github.com/pytorch/tensorpipe.git) registered for path 'third_party/tensorpipe' 2025-12-04T09:43:35.0596206Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/android/libs/fbjni'... 2025-12-04T09:43:35.2818358Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/FP16'... 2025-12-04T09:43:35.2819278Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/FXdiv'... 2025-12-04T09:43:35.2859106Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/flatbuffers'... 2025-12-04T09:43:38.0323513Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/NNPACK'... 2025-12-04T09:43:38.0324482Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/benchmark'... 2025-12-04T09:43:38.0325364Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/NVTX'... 2025-12-04T09:43:38.0326205Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/gloo'... 2025-12-04T09:43:38.0327098Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/flash-attention'... 2025-12-04T09:43:38.0328032Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cpuinfo'... 2025-12-04T09:43:38.0329598Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/gemmlowp/gemmlowp'... 2025-12-04T09:43:38.0330507Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/ideep'... 2025-12-04T09:43:38.0331634Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cpp-httplib'... 2025-12-04T09:43:38.0332646Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/ittapi'... 2025-12-04T09:43:38.0333510Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kleidiai'... 2025-12-04T09:43:38.0334408Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/pocketfft'... 2025-12-04T09:43:38.0335418Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cudnn_frontend'... 2025-12-04T09:43:38.0336321Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/psimd'... 2025-12-04T09:43:38.0337258Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/googletest'... 2025-12-04T09:43:38.0338162Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/pthreadpool'... 2025-12-04T09:43:38.0339145Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/VulkanMemoryAllocator'... 2025-12-04T09:43:38.0340383Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/mimalloc'... 2025-12-04T09:43:38.0341667Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fmt'... 2025-12-04T09:43:38.1172705Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp'... 2025-12-04T09:43:50.6172063Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/python-peachpy'... 2025-12-04T09:43:50.6173062Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto'... 2025-12-04T09:43:50.6173970Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe'... 2025-12-04T09:43:50.6174847Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/pybind11'... 2025-12-04T09:43:50.6175709Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/sleef'... 2025-12-04T09:43:50.6176569Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm'... 2025-12-04T09:43:50.6177410Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/cutlass'... 2025-12-04T09:43:50.6178555Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/onnx'... 2025-12-04T09:43:50.6179464Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/composable_kernel'... 2025-12-04T09:43:50.6180259Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/nlohmann'... 2025-12-04T09:43:50.7172856Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/XNNPACK'... 2025-12-04T09:43:57.0513470Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/aiter'... 2025-12-04T09:43:57.0514077Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/protobuf'... 2025-12-04T09:43:57.0713454Z Submodule path 'android/libs/fbjni': checked out '7e1e1fe3858c63c251c637ae41a20de425dde96f' 2025-12-04T09:43:57.0868308Z Submodule path 'third_party/FP16': checked out '4dfe081cf6bcd15db339cf2680b9281b8451eeb3' 2025-12-04T09:43:57.0999485Z Submodule path 'third_party/FXdiv': checked out 'b408327ac2a15ec3e43352421954f5b1967701d1' 2025-12-04T09:43:57.1324105Z Submodule path 'third_party/NNPACK': checked out 'c07e3a0400713d546e0dea2d5466dd22ea389c73' 2025-12-04T09:43:57.2277873Z Submodule path 'third_party/NVTX': checked out '3ebbc93ded7285963bff932c678fa367eb393ba6' 2025-12-04T09:43:57.2890710Z Submodule path 'third_party/VulkanMemoryAllocator': checked out '1d8f600fd424278486eade7ed3e877c99f0846b1' 2025-12-04T09:43:58.2143188Z Submodule path 'third_party/XNNPACK': checked out '51a0103656eff6fc9bfd39a4597923c4b542c883' 2025-12-04T09:43:58.4142689Z Submodule path 'third_party/aiter': checked out '01aae101b9e5e94d6c16a9514c9fb8df99c93150' 2025-12-04T09:43:58.4167650Z Submodule '3rdparty/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T09:43:58.4201733Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/aiter/3rdparty/composable_kernel'... 2025-12-04T09:44:02.7627718Z Submodule path 'third_party/aiter/3rdparty/composable_kernel': checked out 'cffe8fa2a442ac8e80dd236a1a5d24fe3d7e0cbf' 2025-12-04T09:44:02.7951027Z Submodule path 'third_party/benchmark': checked out '299e5928955cc62af9968370293b916f5130916f' 2025-12-04T09:44:03.2304433Z Submodule path 'third_party/composable_kernel': checked out '7fe50dc3da2069d6645d9deb8c017a876472a977' 2025-12-04T09:44:03.2862283Z Submodule path 'third_party/cpp-httplib': checked out '89c932f313c6437c38f2982869beacc89c2f2246' 2025-12-04T09:44:03.3918781Z Submodule path 'third_party/cpuinfo': checked out 'f858c30bcb16f8effd5ff46996f0514539e17abc' 2025-12-04T09:44:03.4463428Z Submodule path 'third_party/cudnn_frontend': checked out '0b1577c8c83401237d601d0d0db5210506705396' 2025-12-04T09:44:04.2005906Z Submodule path 'third_party/cutlass': checked out 'f88806b1e31dfa579842638740216dd41fc6c588' 2025-12-04T09:44:04.3873760Z Submodule path 'third_party/fbgemm': checked out 'c0b988d39a9e47c794d699f29930ed4d7c7e13a4' 2025-12-04T09:44:04.3902102Z Submodule 'external/asmjit' (https://github.com/asmjit/asmjit.git) registered for path 'third_party/fbgemm/external/asmjit' 2025-12-04T09:44:04.3905840Z Submodule 'external/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'third_party/fbgemm/external/composable_kernel' 2025-12-04T09:44:04.3910079Z Submodule 'external/cpuinfo' (https://github.com/pytorch/cpuinfo) registered for path 'third_party/fbgemm/external/cpuinfo' 2025-12-04T09:44:04.3914257Z Submodule 'external/cutlass' (https://github.com/jwfromm/cutlass) registered for path 'third_party/fbgemm/external/cutlass' 2025-12-04T09:44:04.3918138Z Submodule 'external/googletest' (https://github.com/google/googletest) registered for path 'third_party/fbgemm/external/googletest' 2025-12-04T09:44:04.3922569Z Submodule 'external/hipify_torch' (https://github.com/ROCmSoftwarePlatform/hipify_torch.git) registered for path 'third_party/fbgemm/external/hipify_torch' 2025-12-04T09:44:04.3926964Z Submodule 'external/json' (https://github.com/nlohmann/json.git) registered for path 'third_party/fbgemm/external/json' 2025-12-04T09:44:04.3961362Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/asmjit'... 2025-12-04T09:44:05.7344360Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/hipify_torch'... 2025-12-04T09:44:05.7345241Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/cpuinfo'... 2025-12-04T09:44:05.7346356Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/googletest'... 2025-12-04T09:44:05.8345917Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/composable_kernel'... 2025-12-04T09:44:08.7035810Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/cutlass'... 2025-12-04T09:44:08.8036844Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/fbgemm/external/json'... 2025-12-04T09:44:10.8262111Z Submodule path 'third_party/fbgemm/external/asmjit': checked out 'a3199e8857792cd10b7589ff5d58343d2c9008ea' 2025-12-04T09:44:11.2688519Z Submodule path 'third_party/fbgemm/external/composable_kernel': checked out '7fe50dc3da2069d6645d9deb8c017a876472a977' 2025-12-04T09:44:11.3761606Z Submodule path 'third_party/fbgemm/external/cpuinfo': checked out '6543fec09b2f04ac4a666882998b534afc9c1349' 2025-12-04T09:44:12.1088537Z Submodule path 'third_party/fbgemm/external/cutlass': checked out '98125ce499b0fdf7ffbe0e3052f5b8709f4840f8' 2025-12-04T09:44:12.1603067Z Submodule path 'third_party/fbgemm/external/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T09:44:12.1753436Z Submodule path 'third_party/fbgemm/external/hipify_torch': checked out '63b6a7b541fa7f08f8475ca7d74054db36ff2691' 2025-12-04T09:44:12.2968939Z Submodule path 'third_party/fbgemm/external/json': checked out '9cca280a4d0ccf0c08f47a99aa71d1b0e52f8d03' 2025-12-04T09:44:12.3841852Z Submodule path 'third_party/flash-attention': checked out '979702c87a8713a8e0a5e9fee122b90d2ef13be5' 2025-12-04T09:44:12.3865414Z Submodule 'csrc/composable_kernel' (https://github.com/ROCm/composable_kernel.git) registered for path 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T09:44:12.3869189Z Submodule 'csrc/cutlass' (https://github.com/NVIDIA/cutlass.git) registered for path 'third_party/flash-attention/csrc/cutlass' 2025-12-04T09:44:12.3900977Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/flash-attention/csrc/composable_kernel'... 2025-12-04T09:44:16.3820454Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/flash-attention/csrc/cutlass'... 2025-12-04T09:44:16.6936548Z Submodule path 'third_party/flash-attention/csrc/composable_kernel': checked out '888317e698e9803c62bd38568abc9e05d7709f33' 2025-12-04T09:44:17.3532503Z Submodule path 'third_party/flash-attention/csrc/cutlass': checked out 'c506e16788cb08416a4a57e11a9067beeee29420' 2025-12-04T09:44:17.5230730Z Submodule path 'third_party/flatbuffers': checked out 'a2cd1ea3b6d3fee220106b5fed3f7ce8da9eb757' 2025-12-04T09:44:17.5565669Z Submodule path 'third_party/fmt': checked out '407c905e45ad75fc29bf0f9bb7c5c2fd3475976f' 2025-12-04T09:44:17.6001560Z Submodule path 'third_party/gemmlowp/gemmlowp': checked out '3fb5c176c17c765a3492cd2f0321b0dab712f350' 2025-12-04T09:44:17.6336367Z Submodule path 'third_party/gloo': checked out '54cbae0d3a67fa890b4c3d9ee162b7860315e341' 2025-12-04T09:44:17.6831852Z Submodule path 'third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T09:44:17.6997676Z Submodule path 'third_party/ideep': checked out '719d8e6cd7f7a0e01b155657526d693acf97c2b3' 2025-12-04T09:44:17.7018838Z Submodule 'mkl-dnn' (https://github.com/intel/mkl-dnn.git) registered for path 'third_party/ideep/mkl-dnn' 2025-12-04T09:44:17.7049682Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/ideep/mkl-dnn'... 2025-12-04T09:44:31.9608408Z Submodule path 'third_party/ideep/mkl-dnn': checked out '8d263e693366ef8db40acc569cc7d8edf644556d' 2025-12-04T09:44:31.9875957Z Submodule path 'third_party/ittapi': checked out 'dec1d23ca65ab069d225dfe40dea14f455170959' 2025-12-04T09:44:32.0763031Z Submodule path 'third_party/kineto': checked out '31f85df8fbd89c188f14ef10f1ec65379786b943' 2025-12-04T09:44:32.0785665Z Submodule 'libkineto/third_party/dynolog' (https://github.com/facebookincubator/dynolog.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T09:44:32.0789453Z Submodule 'libkineto/third_party/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T09:44:32.0793715Z Submodule 'libkineto/third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T09:44:32.0826461Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog'... 2025-12-04T09:44:33.0115050Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/fmt'... 2025-12-04T09:44:33.4353161Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/googletest'... 2025-12-04T09:44:33.5314169Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog': checked out 'd2ffe0a4e3acace628db49974246b66fc3e85fb1' 2025-12-04T09:44:33.5336185Z Submodule 'third_party/DCGM' (https://github.com/NVIDIA/DCGM.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T09:44:33.5340181Z Submodule 'third_party/cpr' (https://github.com/libcpr/cpr.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T09:44:33.5344256Z Submodule 'third_party/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T09:44:33.5348465Z Submodule 'third_party/gflags' (https://github.com/gflags/gflags.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T09:44:33.5352767Z Submodule 'third_party/glog' (https://github.com/google/glog.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T09:44:33.5357357Z Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T09:44:33.5361419Z Submodule 'third_party/json' (https://github.com/nlohmann/json.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T09:44:33.5365719Z Submodule 'third_party/pfs' (https://github.com/dtrugman/pfs.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T09:44:33.5370456Z Submodule 'third_party/prometheus-cpp' (https://github.com/jupp0r/prometheus-cpp.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T09:44:33.5405664Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM'... 2025-12-04T09:44:35.5756705Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/pfs'... 2025-12-04T09:44:35.5758059Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/gflags'... 2025-12-04T09:44:35.5760147Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp'... 2025-12-04T09:44:35.5761108Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/cpr'... 2025-12-04T09:44:35.5762139Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/glog'... 2025-12-04T09:44:35.5763863Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/googletest'... 2025-12-04T09:44:35.5764662Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/fmt'... 2025-12-04T09:44:35.6757228Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/json'... 2025-12-04T09:44:40.0204013Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM': checked out 'ffde4e54bc7249a6039a5e6b45b395141e1217f9' 2025-12-04T09:44:40.0441504Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr': checked out '871ed52d350214a034f6ef8a3b8f51c5ce1bd400' 2025-12-04T09:44:40.0854851Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt': checked out 'cd4af11efc9c622896a3e4cb599fa28668ca3d05' 2025-12-04T09:44:40.1033251Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags': checked out 'e171aa2d15ed9eb17054558e0b3a6a413bb01067' 2025-12-04T09:44:40.1055995Z Submodule 'doc' (https://github.com/gflags/gflags.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T09:44:40.1087378Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc'... 2025-12-04T09:44:40.3787066Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc': checked out '8411df715cf522606e3b1aca386ddfc0b63d34b4' 2025-12-04T09:44:40.4022949Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog': checked out 'b33e3bad4c46c8a6345525fd822af355e5ef9446' 2025-12-04T09:44:40.4524589Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T09:44:40.5645145Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/json': checked out '4f8fba14066156b73f1189a2b8bd568bde5284c5' 2025-12-04T09:44:40.5857952Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs': checked out 'f68a2fa8ea36c783bdd760371411fcb495aa3150' 2025-12-04T09:44:40.6084314Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp': checked out 'b1234816facfdda29845c46696a02998a4af115a' 2025-12-04T09:44:40.6105698Z Submodule 'civetweb' (https://github.com/civetweb/civetweb.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:44:40.6109832Z Submodule 'googletest' (https://github.com/google/googletest.git) registered for path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:44:40.6144082Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb'... 2025-12-04T09:44:42.9221439Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest'... 2025-12-04T09:44:43.1797227Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'd7ba35bbb649209c66e582d5a0244ba988a15159' 2025-12-04T09:44:43.2327046Z Submodule path 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' 2025-12-04T09:44:43.2686891Z Submodule path 'third_party/kineto/libkineto/third_party/fmt': checked out '40626af88bd7df9a5fb80be7b25ac85b122d6c21' 2025-12-04T09:44:43.3189542Z Submodule path 'third_party/kineto/libkineto/third_party/googletest': checked out '52eb8108c5bdec04579160ae17225d66034bd723' 2025-12-04T09:44:43.3846556Z Submodule path 'third_party/kleidiai': checked out 'd7770c89632329a9914ef1a90289917597639cbe' 2025-12-04T09:44:43.4307900Z Submodule path 'third_party/mimalloc': checked out 'fbd8b99c2b828428947d70fdc046bb55609be93e' 2025-12-04T09:44:43.5534782Z Submodule path 'third_party/nlohmann': checked out '55f93686c01528224f448c19128836e7df245f72' 2025-12-04T09:44:44.1457985Z Submodule path 'third_party/onnx': checked out 'e709452ef2bbc1d113faf678c24e6d3467696e83' 2025-12-04T09:44:44.1497554Z Submodule 'third_party/pybind11' (https://github.com/pybind/pybind11.git) registered for path 'third_party/onnx/third_party/pybind11' 2025-12-04T09:44:44.1530479Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/onnx/third_party/pybind11'... 2025-12-04T09:44:44.9221978Z Submodule path 'third_party/onnx/third_party/pybind11': checked out 'a2e59f0e7065404b44dfe92a28aca47ba1378dc4' 2025-12-04T09:44:45.0149061Z Submodule path 'third_party/opentelemetry-cpp': checked out 'a799f4aed9c94b765dcdaabaeab7d5e7e2310878' 2025-12-04T09:44:45.0174480Z Submodule 'third_party/benchmark' (https://github.com/google/benchmark) registered for path 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T09:44:45.0178182Z Submodule 'third_party/googletest' (https://github.com/google/googletest) registered for path 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T09:44:45.0181917Z Submodule 'third_party/ms-gsl' (https://github.com/microsoft/GSL) registered for path 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T09:44:45.0185880Z Submodule 'third_party/nlohmann-json' (https://github.com/nlohmann/json) registered for path 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T09:44:45.0190182Z Submodule 'third_party/opentelemetry-proto' (https://github.com/open-telemetry/opentelemetry-proto) registered for path 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T09:44:45.0194185Z Submodule 'third_party/opentracing-cpp' (https://github.com/opentracing/opentracing-cpp.git) registered for path 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T09:44:45.0198292Z Submodule 'third_party/prometheus-cpp' (https://github.com/jupp0r/prometheus-cpp) registered for path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T09:44:45.0202370Z Submodule 'tools/vcpkg' (https://github.com/Microsoft/vcpkg) registered for path 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T09:44:45.0236987Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/benchmark'... 2025-12-04T09:44:45.4253095Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/opentracing-cpp'... 2025-12-04T09:44:45.4254529Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/opentelemetry-proto'... 2025-12-04T09:44:45.4256124Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/ms-gsl'... 2025-12-04T09:44:45.4257392Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/prometheus-cpp'... 2025-12-04T09:44:45.5254697Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/googletest'... 2025-12-04T09:44:46.0062751Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/nlohmann-json'... 2025-12-04T09:44:51.7935171Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/tools/vcpkg'... 2025-12-04T09:44:52.4779391Z Submodule path 'third_party/opentelemetry-cpp/third_party/benchmark': checked out 'd572f4777349d43653b21d6c2fc63020ab326db2' 2025-12-04T09:44:52.5232519Z Submodule path 'third_party/opentelemetry-cpp/third_party/googletest': checked out 'b796f7d44681514f58a683a3a71ff17c94edb0c1' 2025-12-04T09:44:52.5434390Z Submodule path 'third_party/opentelemetry-cpp/third_party/ms-gsl': checked out '6f4529395c5b7c2d661812257cd6780c67e54afa' 2025-12-04T09:44:52.6661321Z Submodule path 'third_party/opentelemetry-cpp/third_party/nlohmann-json': checked out 'bc889afb4c5bf1c0d8ee29ef35eaaf4c8bef8a5d' 2025-12-04T09:44:52.6830650Z Submodule path 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto': checked out '4ca4f0335c63cda7ab31ea7ed70d6553aee14dce' 2025-12-04T09:44:52.7030098Z Submodule path 'third_party/opentelemetry-cpp/third_party/opentracing-cpp': checked out '06b57f48ded1fa3bdd3d4346f6ef29e40e08eaf5' 2025-12-04T09:44:52.7246423Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp': checked out 'c9ffcdda9086ffd9e1283ea7a0276d831f3c8a8d' 2025-12-04T09:44:52.7268067Z Submodule 'civetweb' (https://github.com/civetweb/civetweb.git) registered for path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:44:52.7271977Z Submodule 'googletest' (https://github.com/google/googletest.git) registered for path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:44:52.7304571Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb'... 2025-12-04T09:44:54.5897350Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest'... 2025-12-04T09:44:54.8400548Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb': checked out 'eefb26f82b233268fc98577d265352720d477ba4' 2025-12-04T09:44:54.8922151Z Submodule path 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest': checked out 'e2239ee6043f73722e7aa812a459f54a28552929' 2025-12-04T09:44:55.5958136Z Submodule path 'third_party/opentelemetry-cpp/tools/vcpkg': checked out '8eb57355a4ffb410a2e94c07b4dca2dffbee8e50' 2025-12-04T09:44:55.6119648Z Submodule path 'third_party/pocketfft': checked out '0fa0ef591e38c2758e3184c6c23e497b9f732ffa' 2025-12-04T09:44:55.9177726Z Submodule path 'third_party/protobuf': checked out 'd1eca4e4b421cd2997495c4b4e65cea6be4e9b8a' 2025-12-04T09:44:55.9204774Z Submodule 'third_party/benchmark' (https://github.com/google/benchmark.git) registered for path 'third_party/protobuf/third_party/benchmark' 2025-12-04T09:44:55.9208525Z Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/protobuf/third_party/googletest' 2025-12-04T09:44:55.9241852Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/protobuf/third_party/benchmark'... 2025-12-04T09:44:56.4389134Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/protobuf/third_party/googletest'... 2025-12-04T09:44:56.7564487Z Submodule path 'third_party/protobuf/third_party/benchmark': checked out '5b7683f49e1e9223cf9927b24f6fd3d6bd82e3f8' 2025-12-04T09:44:56.8317862Z Submodule path 'third_party/protobuf/third_party/googletest': checked out '5ec7f0c4a113e2f18ac2c6cc7df51ad6afc24081' 2025-12-04T09:44:56.8455734Z Submodule path 'third_party/psimd': checked out '072586a71b55b7f8c584153d223e95687148a900' 2025-12-04T09:44:56.8632988Z Submodule path 'third_party/pthreadpool': checked out '4fe0e1e183925bf8cfa6aae24237e724a96479b8' 2025-12-04T09:44:56.9156139Z Submodule path 'third_party/pybind11': checked out 'f5fbe867d2d26e4a0a9177a51f6e568868ad3dc8' 2025-12-04T09:44:56.9504618Z Submodule path 'third_party/python-peachpy': checked out 'f45429b087dd7d5bc78bb40dc7cf06425c252d67' 2025-12-04T09:44:56.9983212Z Submodule path 'third_party/sleef': checked out '5a1d179df9cf652951b59010a2d2075372d67f68' 2025-12-04T09:44:57.0343155Z Submodule path 'third_party/tensorpipe': checked out '2b4cd91092d335a697416b2a3cb398283246849d' 2025-12-04T09:44:57.0367024Z Submodule 'third_party/googletest' (https://github.com/google/googletest.git) registered for path 'third_party/tensorpipe/third_party/googletest' 2025-12-04T09:44:57.0370888Z Submodule 'third_party/libnop' (https://github.com/google/libnop.git) registered for path 'third_party/tensorpipe/third_party/libnop' 2025-12-04T09:44:57.0374419Z Submodule 'third_party/libuv' (https://github.com/libuv/libuv.git) registered for path 'third_party/tensorpipe/third_party/libuv' 2025-12-04T09:44:57.0378207Z Submodule 'third_party/pybind11' (https://github.com/pybind/pybind11.git) registered for path 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T09:44:57.0411578Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/googletest'... 2025-12-04T09:44:57.9932609Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/libnop'... 2025-12-04T09:44:57.9933502Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/pybind11'... 2025-12-04T09:44:58.0223776Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/libuv'... 2025-12-04T09:44:58.0819168Z Submodule path 'third_party/tensorpipe/third_party/googletest': checked out 'aee0f9d9b5b87796ee8a0ab26b7587ec30e8858e' 2025-12-04T09:44:58.1024688Z Submodule path 'third_party/tensorpipe/third_party/libnop': checked out '910b55815be16109f04f4180e9adee14fb4ce281' 2025-12-04T09:44:58.1823015Z Submodule path 'third_party/tensorpipe/third_party/libuv': checked out '5152db2cbfeb5582e9c27c5ea1dba2cd9e10759b' 2025-12-04T09:44:58.2163737Z Submodule path 'third_party/tensorpipe/third_party/pybind11': checked out 'a23996fce38ff6ccfbcdc09f1e63f2c4be5ea2ef' 2025-12-04T09:44:58.2185841Z Submodule 'tools/clang' (https://github.com/wjakob/clang-cindex-python3) registered for path 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T09:44:58.2223806Z Cloning into '/home/ec2-user/actions-runner/_work/pytorch/pytorch/third_party/tensorpipe/third_party/pybind11/tools/clang'... 2025-12-04T09:44:58.4076504Z Submodule path 'third_party/tensorpipe/third_party/pybind11/tools/clang': checked out '6a00cbc4a9b8e68b71caf7f774b3f9c753ae84d5' 2025-12-04T09:44:58.4126518Z [command]/usr/bin/git submodule foreach --recursive git config --local gc.auto 0 2025-12-04T09:44:58.4486107Z Entering 'android/libs/fbjni' 2025-12-04T09:44:58.4539074Z Entering 'third_party/FP16' 2025-12-04T09:44:58.4593182Z Entering 'third_party/FXdiv' 2025-12-04T09:44:58.4645029Z Entering 'third_party/NNPACK' 2025-12-04T09:44:58.4698435Z Entering 'third_party/NVTX' 2025-12-04T09:44:58.4751883Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T09:44:58.4806945Z Entering 'third_party/XNNPACK' 2025-12-04T09:44:58.4871068Z Entering 'third_party/aiter' 2025-12-04T09:44:58.4921552Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T09:44:58.4980383Z Entering 'third_party/benchmark' 2025-12-04T09:44:58.5031606Z Entering 'third_party/composable_kernel' 2025-12-04T09:44:58.5093852Z Entering 'third_party/cpp-httplib' 2025-12-04T09:44:58.5146629Z Entering 'third_party/cpuinfo' 2025-12-04T09:44:58.5200578Z Entering 'third_party/cudnn_frontend' 2025-12-04T09:44:58.5252113Z Entering 'third_party/cutlass' 2025-12-04T09:44:58.5313616Z Entering 'third_party/fbgemm' 2025-12-04T09:44:58.5367509Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T09:44:58.5417541Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T09:44:58.5479804Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T09:44:58.5531398Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T09:44:58.5589115Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T09:44:58.5639928Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T09:44:58.5690292Z Entering 'third_party/fbgemm/external/json' 2025-12-04T09:44:58.5745062Z Entering 'third_party/flash-attention' 2025-12-04T09:44:58.5798521Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T09:44:58.5855483Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T09:44:58.5918461Z Entering 'third_party/flatbuffers' 2025-12-04T09:44:58.5973758Z Entering 'third_party/fmt' 2025-12-04T09:44:58.6028306Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T09:44:58.6080901Z Entering 'third_party/gloo' 2025-12-04T09:44:58.6132119Z Entering 'third_party/googletest' 2025-12-04T09:44:58.6183051Z Entering 'third_party/ideep' 2025-12-04T09:44:58.6249758Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T09:44:58.6295695Z Entering 'third_party/ittapi' 2025-12-04T09:44:58.6348370Z Entering 'third_party/kineto' 2025-12-04T09:44:58.6402680Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T09:44:58.6451244Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T09:44:58.6505402Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T09:44:58.6558740Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T09:44:58.6610235Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T09:44:58.6659872Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T09:44:58.6714109Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T09:44:58.6767177Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T09:44:58.6818744Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T09:44:58.6870667Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T09:44:58.6920581Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T09:44:58.6973700Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:44:58.7029598Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:44:58.7085835Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T09:44:58.7140254Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T09:44:58.7194181Z Entering 'third_party/kleidiai' 2025-12-04T09:44:58.7247851Z Entering 'third_party/mimalloc' 2025-12-04T09:44:58.7301347Z Entering 'third_party/nlohmann' 2025-12-04T09:44:58.7353413Z Entering 'third_party/onnx' 2025-12-04T09:44:58.7420970Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T09:44:58.7475314Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T09:44:58.7536580Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T09:44:58.7589057Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T09:44:58.7639798Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T09:44:58.7690475Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T09:44:58.7741775Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T09:44:58.7794058Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T09:44:58.7844205Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T09:44:58.7895215Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:44:58.7949538Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:44:58.8003707Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T09:44:58.8075194Z Entering 'third_party/pocketfft' 2025-12-04T09:44:58.8127509Z Entering 'third_party/protobuf' 2025-12-04T09:44:58.8181802Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T09:44:58.8234871Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T09:44:58.8290736Z Entering 'third_party/psimd' 2025-12-04T09:44:58.8341085Z Entering 'third_party/pthreadpool' 2025-12-04T09:44:58.8395961Z Entering 'third_party/pybind11' 2025-12-04T09:44:58.8449445Z Entering 'third_party/python-peachpy' 2025-12-04T09:44:58.8503455Z Entering 'third_party/sleef' 2025-12-04T09:44:58.8561852Z Entering 'third_party/tensorpipe' 2025-12-04T09:44:58.8611755Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T09:44:58.8663645Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T09:44:58.8715944Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T09:44:58.8767849Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T09:44:58.8816091Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T09:44:58.8888086Z ##[endgroup] 2025-12-04T09:44:58.8888828Z ##[group]Persisting credentials for submodules 2025-12-04T09:44:58.8894522Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'url\.https\:\/\/github\.com\/\.insteadOf' && git config --local --unset-all 'url.https://github.com/.insteadOf' || :" 2025-12-04T09:44:58.9248321Z Entering 'android/libs/fbjni' 2025-12-04T09:44:58.9325494Z Entering 'third_party/FP16' 2025-12-04T09:44:58.9404696Z Entering 'third_party/FXdiv' 2025-12-04T09:44:58.9474349Z Entering 'third_party/NNPACK' 2025-12-04T09:44:58.9548459Z Entering 'third_party/NVTX' 2025-12-04T09:44:58.9615080Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T09:44:58.9689837Z Entering 'third_party/XNNPACK' 2025-12-04T09:44:58.9772725Z Entering 'third_party/aiter' 2025-12-04T09:44:58.9840938Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T09:44:58.9917520Z Entering 'third_party/benchmark' 2025-12-04T09:44:58.9989974Z Entering 'third_party/composable_kernel' 2025-12-04T09:44:59.0066474Z Entering 'third_party/cpp-httplib' 2025-12-04T09:44:59.0135210Z Entering 'third_party/cpuinfo' 2025-12-04T09:44:59.0206953Z Entering 'third_party/cudnn_frontend' 2025-12-04T09:44:59.0277963Z Entering 'third_party/cutlass' 2025-12-04T09:44:59.0356807Z Entering 'third_party/fbgemm' 2025-12-04T09:44:59.0435837Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T09:44:59.0512319Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T09:44:59.0587533Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T09:44:59.0657772Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T09:44:59.0735024Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T09:44:59.0802452Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T09:44:59.0869780Z Entering 'third_party/fbgemm/external/json' 2025-12-04T09:44:59.0942600Z Entering 'third_party/flash-attention' 2025-12-04T09:44:59.1010118Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T09:44:59.1083521Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T09:44:59.1160091Z Entering 'third_party/flatbuffers' 2025-12-04T09:44:59.1231428Z Entering 'third_party/fmt' 2025-12-04T09:44:59.1300257Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T09:44:59.1370035Z Entering 'third_party/gloo' 2025-12-04T09:44:59.1445078Z Entering 'third_party/googletest' 2025-12-04T09:44:59.1521556Z Entering 'third_party/ideep' 2025-12-04T09:44:59.1587804Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T09:44:59.1665537Z Entering 'third_party/ittapi' 2025-12-04T09:44:59.1735962Z Entering 'third_party/kineto' 2025-12-04T09:44:59.1809339Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T09:44:59.1877252Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T09:44:59.1952321Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T09:44:59.2020975Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T09:44:59.2090465Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T09:44:59.2158438Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T09:44:59.2236554Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T09:44:59.2312131Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T09:44:59.2380964Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T09:44:59.2450816Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T09:44:59.2519855Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T09:44:59.2587813Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:44:59.2662308Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:44:59.2737136Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T09:44:59.2808182Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T09:44:59.2880157Z Entering 'third_party/kleidiai' 2025-12-04T09:44:59.2949650Z Entering 'third_party/mimalloc' 2025-12-04T09:44:59.3019322Z Entering 'third_party/nlohmann' 2025-12-04T09:44:59.3091143Z Entering 'third_party/onnx' 2025-12-04T09:44:59.3174470Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T09:44:59.3257496Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T09:44:59.3333684Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T09:44:59.3405902Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T09:44:59.3480428Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T09:44:59.3548755Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T09:44:59.3619669Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T09:44:59.3687287Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T09:44:59.3758394Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T09:44:59.3824627Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:44:59.3897784Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:44:59.3974367Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T09:44:59.4070360Z Entering 'third_party/pocketfft' 2025-12-04T09:44:59.4138935Z Entering 'third_party/protobuf' 2025-12-04T09:44:59.4211325Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T09:44:59.4278151Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T09:44:59.4355975Z Entering 'third_party/psimd' 2025-12-04T09:44:59.4429010Z Entering 'third_party/pthreadpool' 2025-12-04T09:44:59.4500116Z Entering 'third_party/pybind11' 2025-12-04T09:44:59.4569732Z Entering 'third_party/python-peachpy' 2025-12-04T09:44:59.4639405Z Entering 'third_party/sleef' 2025-12-04T09:44:59.4709944Z Entering 'third_party/tensorpipe' 2025-12-04T09:44:59.4778623Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T09:44:59.4849131Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T09:44:59.4916551Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T09:44:59.4984308Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T09:44:59.5049177Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T09:44:59.5140625Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local 'http.https://github.com/.extraheader' 'AUTHORIZATION: basic ***' && git config --local --show-origin --name-only --get-regexp remote.origin.url" 2025-12-04T09:44:59.5494064Z Entering 'android/libs/fbjni' 2025-12-04T09:44:59.5557453Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config remote.origin.url 2025-12-04T09:44:59.5578421Z Entering 'third_party/FP16' 2025-12-04T09:44:59.5644371Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config remote.origin.url 2025-12-04T09:44:59.5667390Z Entering 'third_party/FXdiv' 2025-12-04T09:44:59.5733517Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config remote.origin.url 2025-12-04T09:44:59.5756201Z Entering 'third_party/NNPACK' 2025-12-04T09:44:59.5828571Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config remote.origin.url 2025-12-04T09:44:59.5849598Z Entering 'third_party/NVTX' 2025-12-04T09:44:59.5914650Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config remote.origin.url 2025-12-04T09:44:59.5937476Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T09:44:59.6004696Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config remote.origin.url 2025-12-04T09:44:59.6027499Z Entering 'third_party/XNNPACK' 2025-12-04T09:44:59.6093133Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config remote.origin.url 2025-12-04T09:44:59.6128965Z Entering 'third_party/aiter' 2025-12-04T09:44:59.6193631Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config remote.origin.url 2025-12-04T09:44:59.6215504Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T09:44:59.6281970Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config remote.origin.url 2025-12-04T09:44:59.6313699Z Entering 'third_party/benchmark' 2025-12-04T09:44:59.6379443Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config remote.origin.url 2025-12-04T09:44:59.6400546Z Entering 'third_party/composable_kernel' 2025-12-04T09:44:59.6465441Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config remote.origin.url 2025-12-04T09:44:59.6494649Z Entering 'third_party/cpp-httplib' 2025-12-04T09:44:59.6558133Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config remote.origin.url 2025-12-04T09:44:59.6579123Z Entering 'third_party/cpuinfo' 2025-12-04T09:44:59.6642271Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config remote.origin.url 2025-12-04T09:44:59.6662857Z Entering 'third_party/cudnn_frontend' 2025-12-04T09:44:59.6725483Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config remote.origin.url 2025-12-04T09:44:59.6748753Z Entering 'third_party/cutlass' 2025-12-04T09:44:59.6811155Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config remote.origin.url 2025-12-04T09:44:59.6842418Z Entering 'third_party/fbgemm' 2025-12-04T09:44:59.6906111Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config remote.origin.url 2025-12-04T09:44:59.6929531Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T09:44:59.6994047Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config remote.origin.url 2025-12-04T09:44:59.7015962Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T09:44:59.7079644Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config remote.origin.url 2025-12-04T09:44:59.7108116Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T09:44:59.7172078Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config remote.origin.url 2025-12-04T09:44:59.7194493Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T09:44:59.7257006Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config remote.origin.url 2025-12-04T09:44:59.7286363Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T09:44:59.7346126Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config remote.origin.url 2025-12-04T09:44:59.7368104Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T09:44:59.7434236Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config remote.origin.url 2025-12-04T09:44:59.7456159Z Entering 'third_party/fbgemm/external/json' 2025-12-04T09:44:59.7516831Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config remote.origin.url 2025-12-04T09:44:59.7540535Z Entering 'third_party/flash-attention' 2025-12-04T09:44:59.7604569Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config remote.origin.url 2025-12-04T09:44:59.7626226Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T09:44:59.7687355Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config remote.origin.url 2025-12-04T09:44:59.7714577Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T09:44:59.7777373Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config remote.origin.url 2025-12-04T09:44:59.7808252Z Entering 'third_party/flatbuffers' 2025-12-04T09:44:59.7874420Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config remote.origin.url 2025-12-04T09:44:59.7898778Z Entering 'third_party/fmt' 2025-12-04T09:44:59.7964418Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config remote.origin.url 2025-12-04T09:44:59.7986546Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T09:44:59.8050031Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config remote.origin.url 2025-12-04T09:44:59.8071567Z Entering 'third_party/gloo' 2025-12-04T09:44:59.8133067Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config remote.origin.url 2025-12-04T09:44:59.8156443Z Entering 'third_party/googletest' 2025-12-04T09:44:59.8222020Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config remote.origin.url 2025-12-04T09:44:59.8244721Z Entering 'third_party/ideep' 2025-12-04T09:44:59.8311611Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config remote.origin.url 2025-12-04T09:44:59.8332388Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T09:44:59.8396660Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config remote.origin.url 2025-12-04T09:44:59.8425804Z Entering 'third_party/ittapi' 2025-12-04T09:44:59.8486884Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config remote.origin.url 2025-12-04T09:44:59.8508894Z Entering 'third_party/kineto' 2025-12-04T09:44:59.8572110Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config remote.origin.url 2025-12-04T09:44:59.8593705Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T09:44:59.8658787Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config remote.origin.url 2025-12-04T09:44:59.8678817Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T09:44:59.8746077Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config remote.origin.url 2025-12-04T09:44:59.8770257Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T09:44:59.8835575Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config remote.origin.url 2025-12-04T09:44:59.8857695Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T09:44:59.8923302Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config remote.origin.url 2025-12-04T09:44:59.8943700Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T09:44:59.9007272Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config remote.origin.url 2025-12-04T09:44:59.9026930Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T09:44:59.9096342Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config remote.origin.url 2025-12-04T09:44:59.9120103Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T09:44:59.9183842Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config remote.origin.url 2025-12-04T09:44:59.9206235Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T09:44:59.9278041Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config remote.origin.url 2025-12-04T09:44:59.9299810Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T09:44:59.9364757Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config remote.origin.url 2025-12-04T09:44:59.9388296Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T09:44:59.9451846Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config remote.origin.url 2025-12-04T09:44:59.9474650Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T09:44:59.9538687Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T09:44:59.9559016Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:44:59.9626578Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T09:44:59.9648803Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:44:59.9717066Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T09:44:59.9744457Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T09:44:59.9812433Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config remote.origin.url 2025-12-04T09:44:59.9837738Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T09:44:59.9902712Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config remote.origin.url 2025-12-04T09:44:59.9927580Z Entering 'third_party/kleidiai' 2025-12-04T09:44:59.9993306Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config remote.origin.url 2025-12-04T09:45:00.0016509Z Entering 'third_party/mimalloc' 2025-12-04T09:45:00.0085005Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config remote.origin.url 2025-12-04T09:45:00.0106788Z Entering 'third_party/nlohmann' 2025-12-04T09:45:00.0176740Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config remote.origin.url 2025-12-04T09:45:00.0199655Z Entering 'third_party/onnx' 2025-12-04T09:45:00.0264040Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config remote.origin.url 2025-12-04T09:45:00.0298855Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T09:45:00.0364400Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2025-12-04T09:45:00.0391688Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T09:45:00.0454324Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config remote.origin.url 2025-12-04T09:45:00.0476666Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T09:45:00.0543101Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config remote.origin.url 2025-12-04T09:45:00.0565849Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T09:45:00.0628321Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config remote.origin.url 2025-12-04T09:45:00.0649115Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T09:45:00.0713125Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config remote.origin.url 2025-12-04T09:45:00.0732620Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T09:45:00.0796315Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config remote.origin.url 2025-12-04T09:45:00.0818952Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T09:45:00.0883508Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config remote.origin.url 2025-12-04T09:45:00.0903794Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T09:45:00.0967257Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config remote.origin.url 2025-12-04T09:45:00.0987814Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T09:45:00.1051792Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T09:45:00.1072212Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:45:00.1136268Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T09:45:00.1160074Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:45:00.1224681Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T09:45:00.1256623Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T09:45:00.1322318Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config remote.origin.url 2025-12-04T09:45:00.1362980Z Entering 'third_party/pocketfft' 2025-12-04T09:45:00.1425505Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config remote.origin.url 2025-12-04T09:45:00.1447292Z Entering 'third_party/protobuf' 2025-12-04T09:45:00.1512030Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config remote.origin.url 2025-12-04T09:45:00.1536852Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T09:45:00.1606655Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config remote.origin.url 2025-12-04T09:45:00.1628290Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T09:45:00.1691147Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config remote.origin.url 2025-12-04T09:45:00.1714977Z Entering 'third_party/psimd' 2025-12-04T09:45:00.1776900Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config remote.origin.url 2025-12-04T09:45:00.1798620Z Entering 'third_party/pthreadpool' 2025-12-04T09:45:00.1864158Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config remote.origin.url 2025-12-04T09:45:00.1885711Z Entering 'third_party/pybind11' 2025-12-04T09:45:00.1950888Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config remote.origin.url 2025-12-04T09:45:00.1971912Z Entering 'third_party/python-peachpy' 2025-12-04T09:45:00.2033619Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config remote.origin.url 2025-12-04T09:45:00.2055993Z Entering 'third_party/sleef' 2025-12-04T09:45:00.2124460Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config remote.origin.url 2025-12-04T09:45:00.2146652Z Entering 'third_party/tensorpipe' 2025-12-04T09:45:00.2207811Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config remote.origin.url 2025-12-04T09:45:00.2226335Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T09:45:00.2290148Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config remote.origin.url 2025-12-04T09:45:00.2310702Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T09:45:00.2372929Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config remote.origin.url 2025-12-04T09:45:00.2394816Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T09:45:00.2457701Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config remote.origin.url 2025-12-04T09:45:00.2478969Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T09:45:00.2541564Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config remote.origin.url 2025-12-04T09:45:00.2560715Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T09:45:00.2625162Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2025-12-04T09:45:00.3704029Z [command]/usr/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'git@github.com:' 2025-12-04T09:45:00.4069559Z Entering 'android/libs/fbjni' 2025-12-04T09:45:00.4122228Z Entering 'third_party/FP16' 2025-12-04T09:45:00.4173708Z Entering 'third_party/FXdiv' 2025-12-04T09:45:00.4228456Z Entering 'third_party/NNPACK' 2025-12-04T09:45:00.4282515Z Entering 'third_party/NVTX' 2025-12-04T09:45:00.4335891Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T09:45:00.4390125Z Entering 'third_party/XNNPACK' 2025-12-04T09:45:00.4454666Z Entering 'third_party/aiter' 2025-12-04T09:45:00.4510457Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T09:45:00.4571528Z Entering 'third_party/benchmark' 2025-12-04T09:45:00.4622041Z Entering 'third_party/composable_kernel' 2025-12-04T09:45:00.4683501Z Entering 'third_party/cpp-httplib' 2025-12-04T09:45:00.4737646Z Entering 'third_party/cpuinfo' 2025-12-04T09:45:00.4788643Z Entering 'third_party/cudnn_frontend' 2025-12-04T09:45:00.4841437Z Entering 'third_party/cutlass' 2025-12-04T09:45:00.4901626Z Entering 'third_party/fbgemm' 2025-12-04T09:45:00.4959827Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T09:45:00.5010715Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T09:45:00.5068577Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T09:45:00.5118607Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T09:45:00.5176591Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T09:45:00.5235222Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T09:45:00.5288604Z Entering 'third_party/fbgemm/external/json' 2025-12-04T09:45:00.5344699Z Entering 'third_party/flash-attention' 2025-12-04T09:45:00.5398895Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T09:45:00.5454897Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T09:45:00.5516928Z Entering 'third_party/flatbuffers' 2025-12-04T09:45:00.5570873Z Entering 'third_party/fmt' 2025-12-04T09:45:00.5624007Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T09:45:00.5677261Z Entering 'third_party/gloo' 2025-12-04T09:45:00.5728590Z Entering 'third_party/googletest' 2025-12-04T09:45:00.5780861Z Entering 'third_party/ideep' 2025-12-04T09:45:00.5830443Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T09:45:00.5890424Z Entering 'third_party/ittapi' 2025-12-04T09:45:00.5947614Z Entering 'third_party/kineto' 2025-12-04T09:45:00.5999299Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T09:45:00.6049758Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T09:45:00.6106857Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T09:45:00.6160015Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T09:45:00.6211587Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T09:45:00.6261236Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T09:45:00.6315311Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T09:45:00.6372838Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T09:45:00.6425382Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T09:45:00.6479544Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T09:45:00.6530843Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T09:45:00.6581190Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:45:00.6634041Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:45:00.6693161Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T09:45:00.6749347Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T09:45:00.6805119Z Entering 'third_party/kleidiai' 2025-12-04T09:45:00.6860936Z Entering 'third_party/mimalloc' 2025-12-04T09:45:00.6912711Z Entering 'third_party/nlohmann' 2025-12-04T09:45:00.6967428Z Entering 'third_party/onnx' 2025-12-04T09:45:00.7033149Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T09:45:00.7087017Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T09:45:00.7139554Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T09:45:00.7191110Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T09:45:00.7241344Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T09:45:00.7291614Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T09:45:00.7343071Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T09:45:00.7394862Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T09:45:00.7446952Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T09:45:00.7500233Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:45:00.7552345Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:45:00.7607501Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T09:45:00.7684277Z Entering 'third_party/pocketfft' 2025-12-04T09:45:00.7738304Z Entering 'third_party/protobuf' 2025-12-04T09:45:00.7792995Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T09:45:00.7847727Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T09:45:00.7902141Z Entering 'third_party/psimd' 2025-12-04T09:45:00.7958850Z Entering 'third_party/pthreadpool' 2025-12-04T09:45:00.8011372Z Entering 'third_party/pybind11' 2025-12-04T09:45:00.8065284Z Entering 'third_party/python-peachpy' 2025-12-04T09:45:00.8118521Z Entering 'third_party/sleef' 2025-12-04T09:45:00.8172728Z Entering 'third_party/tensorpipe' 2025-12-04T09:45:00.8224024Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T09:45:00.8280093Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T09:45:00.8331068Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T09:45:00.8381274Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T09:45:00.8435022Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T09:45:00.8513974Z [command]/usr/bin/git submodule foreach --recursive git config --local --add 'url.https://github.com/.insteadOf' 'org-21003710@github.com:' 2025-12-04T09:45:00.8864768Z Entering 'android/libs/fbjni' 2025-12-04T09:45:00.8918391Z Entering 'third_party/FP16' 2025-12-04T09:45:00.8969818Z Entering 'third_party/FXdiv' 2025-12-04T09:45:00.9020657Z Entering 'third_party/NNPACK' 2025-12-04T09:45:00.9072526Z Entering 'third_party/NVTX' 2025-12-04T09:45:00.9125582Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T09:45:00.9178715Z Entering 'third_party/XNNPACK' 2025-12-04T09:45:00.9245457Z Entering 'third_party/aiter' 2025-12-04T09:45:00.9300230Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T09:45:00.9359747Z Entering 'third_party/benchmark' 2025-12-04T09:45:00.9410355Z Entering 'third_party/composable_kernel' 2025-12-04T09:45:00.9469826Z Entering 'third_party/cpp-httplib' 2025-12-04T09:45:00.9521211Z Entering 'third_party/cpuinfo' 2025-12-04T09:45:00.9572905Z Entering 'third_party/cudnn_frontend' 2025-12-04T09:45:00.9625388Z Entering 'third_party/cutlass' 2025-12-04T09:45:00.9685769Z Entering 'third_party/fbgemm' 2025-12-04T09:45:00.9746967Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T09:45:00.9798717Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T09:45:00.9856964Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T09:45:00.9917515Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T09:45:00.9979617Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T09:45:01.0029504Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T09:45:01.0080366Z Entering 'third_party/fbgemm/external/json' 2025-12-04T09:45:01.0134296Z Entering 'third_party/flash-attention' 2025-12-04T09:45:01.0187151Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T09:45:01.0245255Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T09:45:01.0308524Z Entering 'third_party/flatbuffers' 2025-12-04T09:45:01.0363678Z Entering 'third_party/fmt' 2025-12-04T09:45:01.0417273Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T09:45:01.0471116Z Entering 'third_party/gloo' 2025-12-04T09:45:01.0524578Z Entering 'third_party/googletest' 2025-12-04T09:45:01.0578859Z Entering 'third_party/ideep' 2025-12-04T09:45:01.0629117Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T09:45:01.0689567Z Entering 'third_party/ittapi' 2025-12-04T09:45:01.0745362Z Entering 'third_party/kineto' 2025-12-04T09:45:01.0796469Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T09:45:01.0848760Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T09:45:01.0902647Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T09:45:01.0956151Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T09:45:01.1009284Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T09:45:01.1059474Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T09:45:01.1114171Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T09:45:01.1168062Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T09:45:01.1220364Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T09:45:01.1273119Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T09:45:01.1324721Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T09:45:01.1377883Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:45:01.1432693Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:45:01.1492271Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T09:45:01.1544312Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T09:45:01.1600378Z Entering 'third_party/kleidiai' 2025-12-04T09:45:01.1653169Z Entering 'third_party/mimalloc' 2025-12-04T09:45:01.1705535Z Entering 'third_party/nlohmann' 2025-12-04T09:45:01.1760465Z Entering 'third_party/onnx' 2025-12-04T09:45:01.1825509Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T09:45:01.1881793Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T09:45:01.1932463Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T09:45:01.1982749Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T09:45:01.2034147Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T09:45:01.2082955Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T09:45:01.2132275Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T09:45:01.2182308Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T09:45:01.2231756Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T09:45:01.2280718Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:45:01.2332050Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:45:01.2385753Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T09:45:01.2457269Z Entering 'third_party/pocketfft' 2025-12-04T09:45:01.2508292Z Entering 'third_party/protobuf' 2025-12-04T09:45:01.2561579Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T09:45:01.2610910Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T09:45:01.2663504Z Entering 'third_party/psimd' 2025-12-04T09:45:01.2715692Z Entering 'third_party/pthreadpool' 2025-12-04T09:45:01.2772595Z Entering 'third_party/pybind11' 2025-12-04T09:45:01.2825981Z Entering 'third_party/python-peachpy' 2025-12-04T09:45:01.2880982Z Entering 'third_party/sleef' 2025-12-04T09:45:01.2932741Z Entering 'third_party/tensorpipe' 2025-12-04T09:45:01.2981464Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T09:45:01.3032017Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T09:45:01.3081631Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T09:45:01.3131815Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T09:45:01.3179585Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T09:45:01.3252801Z ##[endgroup] 2025-12-04T09:45:01.3298150Z [command]/usr/bin/git log -1 --format=%H 2025-12-04T09:45:01.3323277Z ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:45:01.3434601Z ##[group]Run cd "${GITHUB_WORKSPACE}" 2025-12-04T09:45:01.3434967Z cd "${GITHUB_WORKSPACE}" 2025-12-04T09:45:01.3435296Z # Clean stale submodule dirs 2025-12-04T09:45:01.3435629Z if [ -z "${NO_SUDO}" ]; then 2025-12-04T09:45:01.3436021Z  sudo git submodule foreach --recursive git clean -ffdx 2025-12-04T09:45:01.3436554Z else 2025-12-04T09:45:01.3436894Z  git submodule foreach --recursive git clean -ffdx 2025-12-04T09:45:01.3437264Z fi 2025-12-04T09:45:01.3446522Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:45:01.3446924Z env: 2025-12-04T09:45:01.3447201Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:01.3447510Z NO_SUDO: true 2025-12-04T09:45:01.3447811Z ##[endgroup] 2025-12-04T09:45:01.3831115Z Entering 'android/libs/fbjni' 2025-12-04T09:45:01.3872618Z Entering 'third_party/FP16' 2025-12-04T09:45:01.3911152Z Entering 'third_party/FXdiv' 2025-12-04T09:45:01.3949121Z Entering 'third_party/NNPACK' 2025-12-04T09:45:01.3994406Z Entering 'third_party/NVTX' 2025-12-04T09:45:01.4040537Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T09:45:01.4080705Z Entering 'third_party/XNNPACK' 2025-12-04T09:45:01.4216249Z Entering 'third_party/aiter' 2025-12-04T09:45:01.4269656Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T09:45:01.4399954Z Entering 'third_party/benchmark' 2025-12-04T09:45:01.4441225Z Entering 'third_party/composable_kernel' 2025-12-04T09:45:01.4583951Z Entering 'third_party/cpp-httplib' 2025-12-04T09:45:01.4630252Z Entering 'third_party/cpuinfo' 2025-12-04T09:45:01.4683535Z Entering 'third_party/cudnn_frontend' 2025-12-04T09:45:01.4728866Z Entering 'third_party/cutlass' 2025-12-04T09:45:01.4860786Z Entering 'third_party/fbgemm' 2025-12-04T09:45:01.4931892Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T09:45:01.4971186Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T09:45:01.5109653Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T09:45:01.5154646Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T09:45:01.5269443Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T09:45:01.5310831Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T09:45:01.5348617Z Entering 'third_party/fbgemm/external/json' 2025-12-04T09:45:01.5406152Z Entering 'third_party/flash-attention' 2025-12-04T09:45:01.5456882Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T09:45:01.5578361Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T09:45:01.5684765Z Entering 'third_party/flatbuffers' 2025-12-04T09:45:01.5776727Z Entering 'third_party/fmt' 2025-12-04T09:45:01.5825829Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T09:45:01.5869022Z Entering 'third_party/gloo' 2025-12-04T09:45:01.5911085Z Entering 'third_party/googletest' 2025-12-04T09:45:01.5957262Z Entering 'third_party/ideep' 2025-12-04T09:45:01.5997084Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T09:45:01.6097747Z Entering 'third_party/ittapi' 2025-12-04T09:45:01.6140369Z Entering 'third_party/kineto' 2025-12-04T09:45:01.6183839Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T09:45:01.6233014Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T09:45:01.6291063Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T09:45:01.6331931Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T09:45:01.6375637Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T09:45:01.6414881Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T09:45:01.6457219Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T09:45:01.6498576Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T09:45:01.6541806Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T09:45:01.6593328Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T09:45:01.6632497Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T09:45:01.6678214Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:45:01.6738170Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:45:01.6788659Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T09:45:01.6829843Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T09:45:01.6875268Z Entering 'third_party/kleidiai' 2025-12-04T09:45:01.6930938Z Entering 'third_party/mimalloc' 2025-12-04T09:45:01.6971497Z Entering 'third_party/nlohmann' 2025-12-04T09:45:01.7031915Z Entering 'third_party/onnx' 2025-12-04T09:45:01.7452822Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T09:45:01.7500568Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T09:45:01.7575178Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T09:45:01.7616695Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T09:45:01.7662914Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T09:45:01.7700186Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T09:45:01.7751313Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T09:45:01.7790695Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T09:45:01.7829984Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T09:45:01.7869244Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T09:45:01.7929101Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T09:45:01.7977030Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T09:45:01.8314826Z Entering 'third_party/pocketfft' 2025-12-04T09:45:01.8356099Z Entering 'third_party/protobuf' 2025-12-04T09:45:01.8451097Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T09:45:01.8489852Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T09:45:01.8536164Z Entering 'third_party/psimd' 2025-12-04T09:45:01.8583035Z Entering 'third_party/pthreadpool' 2025-12-04T09:45:01.8625965Z Entering 'third_party/pybind11' 2025-12-04T09:45:01.8671025Z Entering 'third_party/python-peachpy' 2025-12-04T09:45:01.8711083Z Entering 'third_party/sleef' 2025-12-04T09:45:01.8754274Z Entering 'third_party/tensorpipe' 2025-12-04T09:45:01.8796805Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T09:45:01.8838243Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T09:45:01.8879865Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T09:45:01.8929935Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T09:45:01.8967560Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T09:45:01.9124362Z Prepare all required actions 2025-12-04T09:45:01.9124949Z Getting action download info 2025-12-04T09:45:02.0534944Z ##[group]Run ./.github/actions/setup-linux 2025-12-04T09:45:02.0535181Z env: 2025-12-04T09:45:02.0535349Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:02.0535538Z ##[endgroup] 2025-12-04T09:45:02.0568319Z ##[group]Run set -euo pipefail 2025-12-04T09:45:02.0568562Z set -euo pipefail 2025-12-04T09:45:02.0568774Z function get_ec2_metadata() { 2025-12-04T09:45:02.0569049Z  # Pulled from instance metadata endpoint for EC2 2025-12-04T09:45:02.0569493Z  # see https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instancedata-data-retrieval.html 2025-12-04T09:45:02.0569889Z  category=$1 2025-12-04T09:45:02.0570146Z  # If it is GCP runner (runner name contains gcp), do not run this 2025-12-04T09:45:02.0570449Z  runner_name_str=i-07df7d64debf86ede 2025-12-04T09:45:02.0570729Z  if [[ -f /.inarc ]]; then 2025-12-04T09:45:02.0570973Z  echo "ARC Runner, no info on ec2 metadata" 2025-12-04T09:45:02.0571240Z  elif [[ $runner_name_str == *"gcp"* ]]; then 2025-12-04T09:45:02.0571567Z  echo "Runner is from Google Cloud Platform, No info on ec2 metadata" 2025-12-04T09:45:02.0572024Z  else 2025-12-04T09:45:02.0572632Z  curl -H "X-aws-ec2-metadata-token: $(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 30")" -fsSL "http://169.254.169.254/latest/meta-data/${category}" 2025-12-04T09:45:02.0573256Z  fi 2025-12-04T09:45:02.0573414Z } 2025-12-04T09:45:02.0573601Z echo "ami-id: $(get_ec2_metadata ami-id)" 2025-12-04T09:45:02.0573895Z echo "instance-id: $(get_ec2_metadata instance-id)" 2025-12-04T09:45:02.0574225Z echo "instance-type: $(get_ec2_metadata instance-type)" 2025-12-04T09:45:02.0574514Z echo "system info $(uname -a)" 2025-12-04T09:45:02.0582780Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:45:02.0583069Z env: 2025-12-04T09:45:02.0583233Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:02.0583418Z ##[endgroup] 2025-12-04T09:45:02.0743416Z ami-id: ami-08982f1c5bf93d976 2025-12-04T09:45:02.0845845Z instance-id: i-07df7d64debf86ede 2025-12-04T09:45:02.0952976Z instance-type: g6.4xlarge 2025-12-04T09:45:02.0966877Z system info Linux ip-10-0-6-74.ec2.internal 6.1.150-174.273.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Sep 9 12:21:26 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux 2025-12-04T09:45:02.0987548Z ##[group]Run if [ -f /usr/bin/nvidia-smi ]; then nvidia-smi; fi 2025-12-04T09:45:02.0987917Z if [ -f /usr/bin/nvidia-smi ]; then nvidia-smi; fi 2025-12-04T09:45:02.0996029Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:45:02.0996301Z env: 2025-12-04T09:45:02.0996467Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:02.0996646Z ##[endgroup] 2025-12-04T09:45:03.5352624Z Thu Dec 4 09:45:03 2025 2025-12-04T09:45:03.5353272Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:45:03.5353942Z | NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 | 2025-12-04T09:45:03.5354396Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:45:03.5355022Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-12-04T09:45:03.5355849Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-12-04T09:45:03.5356403Z | | | MIG M. | 2025-12-04T09:45:03.5356700Z |=========================================+========================+======================| 2025-12-04T09:45:03.5429937Z | 0 NVIDIA L4 Off | 00000000:35:00.0 Off | 0 | 2025-12-04T09:45:03.5430724Z | N/A 37C P0 29W / 72W | 0MiB / 23034MiB | 4% Default | 2025-12-04T09:45:03.5431121Z | | | N/A | 2025-12-04T09:45:03.5431494Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:45:03.5431777Z 2025-12-04T09:45:03.5431940Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:45:03.5432324Z | Processes: | 2025-12-04T09:45:03.5432728Z | GPU GI CI PID Type Process name GPU Memory | 2025-12-04T09:45:03.5433121Z | ID ID Usage | 2025-12-04T09:45:03.5433420Z |=========================================================================================| 2025-12-04T09:45:03.5434743Z | No running processes found | 2025-12-04T09:45:03.5435237Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:45:03.8719145Z ##[group]Run echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T09:45:03.8720006Z echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T09:45:03.8731848Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:45:03.8732133Z env: 2025-12-04T09:45:03.8732294Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:03.8732484Z ##[endgroup] 2025-12-04T09:45:03.8790764Z ##[group]Run if systemctl is-active --quiet docker; then 2025-12-04T09:45:03.8791097Z if systemctl is-active --quiet docker; then 2025-12-04T09:45:03.8791370Z  echo "Docker daemon is running..."; 2025-12-04T09:45:03.8791611Z else 2025-12-04T09:45:03.8791870Z  echo "Starting docker daemon..." && sudo systemctl start docker; 2025-12-04T09:45:03.8792180Z fi 2025-12-04T09:45:03.8800008Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:45:03.8800283Z env: 2025-12-04T09:45:03.8800453Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:03.8800632Z ##[endgroup] 2025-12-04T09:45:03.8897122Z Docker daemon is running... 2025-12-04T09:45:03.8938529Z ##[group]Run nick-fields/retry@v3.0.0 2025-12-04T09:45:03.8938745Z with: 2025-12-04T09:45:03.8938891Z shell: bash 2025-12-04T09:45:03.8939045Z timeout_minutes: 5 2025-12-04T09:45:03.8939227Z max_attempts: 3 2025-12-04T09:45:03.8939394Z retry_wait_seconds: 30 2025-12-04T09:45:03.8941027Z command: AWS_ACCOUNT_ID=$(aws sts get-caller-identity|grep Account|cut -f4 -d\") aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \ --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com" # For LF Runners we need to make sure we also login to Meta's ECR docker registry too. META_AWS_ACCOUNT_ID=308535385114 if [ "$AWS_ACCOUNT_ID" != "$META_AWS_ACCOUNT_ID" ] ; then aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \ --password-stdin "$META_AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com" fi 2025-12-04T09:45:03.8942641Z polling_interval_seconds: 1 2025-12-04T09:45:03.8942846Z warning_on_retry: true 2025-12-04T09:45:03.8943031Z continue_on_error: false 2025-12-04T09:45:03.8943215Z env: 2025-12-04T09:45:03.8943369Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:03.8943555Z AWS_RETRY_MODE: standard 2025-12-04T09:45:03.8943730Z AWS_MAX_ATTEMPTS: 5 2025-12-04T09:45:03.8943913Z AWS_DEFAULT_REGION: us-east-1 2025-12-04T09:45:03.8944120Z ##[endgroup] 2025-12-04T09:45:04.9404387Z WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json. 2025-12-04T09:45:04.9405318Z Configure a credential helper to remove this warning. See 2025-12-04T09:45:04.9405870Z https://docs.docker.com/engine/reference/commandline/login/#credentials-store 2025-12-04T09:45:04.9406236Z 2025-12-04T09:45:04.9406332Z Login Succeeded 2025-12-04T09:45:04.9695580Z Command completed after 1 attempt(s). 2025-12-04T09:45:04.9755679Z ##[group]Run env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T09:45:04.9756079Z env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T09:45:04.9756398Z env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T09:45:04.9764989Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:45:04.9765255Z env: 2025-12-04T09:45:04.9765423Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:04.9765620Z ##[endgroup] 2025-12-04T09:45:04.9981571Z ##[group]Run # ignore expansion of "docker ps -q" since it could be empty 2025-12-04T09:45:04.9981985Z # ignore expansion of "docker ps -q" since it could be empty 2025-12-04T09:45:04.9982300Z # shellcheck disable=SC2046 2025-12-04T09:45:04.9982546Z docker stop $(docker ps -q) || true 2025-12-04T09:45:04.9982785Z # Prune all of the docker images 2025-12-04T09:45:04.9983020Z docker system prune -af 2025-12-04T09:45:04.9990201Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:45:04.9990655Z env: 2025-12-04T09:45:04.9990819Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:04.9991004Z ##[endgroup] 2025-12-04T09:45:05.0286179Z "docker stop" requires at least 1 argument. 2025-12-04T09:45:05.0286568Z See 'docker stop --help'. 2025-12-04T09:45:05.0286736Z 2025-12-04T09:45:05.0286888Z Usage: docker stop [OPTIONS] CONTAINER [CONTAINER...] 2025-12-04T09:45:05.0287128Z 2025-12-04T09:45:05.0287233Z Stop one or more running containers 2025-12-04T09:45:05.0477301Z Total reclaimed space: 0B 2025-12-04T09:45:05.0626010Z ##[group]Run pytorch/test-infra/.github/actions/calculate-docker-image@main 2025-12-04T09:45:05.0626398Z with: 2025-12-04T09:45:05.0626991Z docker-image-name: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:45:05.0627702Z use-custom-docker-registry: true 2025-12-04T09:45:05.0627935Z docker-build-dir: .ci/docker 2025-12-04T09:45:05.0628167Z docker-build-script: ./build.sh 2025-12-04T09:45:05.0628382Z working-directory: . 2025-12-04T09:45:05.0628638Z docker-registry: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:45:05.0628922Z force-push: false 2025-12-04T09:45:05.0629091Z env: 2025-12-04T09:45:05.0629241Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:05.0629420Z ##[endgroup] 2025-12-04T09:45:05.0645829Z ##[group]Run set -ex 2025-12-04T09:45:05.0646046Z set -ex 2025-12-04T09:45:05.0646205Z  2025-12-04T09:45:05.0646513Z # If the docker build directory or the build script doesn't exist, the action will 2025-12-04T09:45:05.0646989Z # gracefully return the docker image name as it is. Pulling docker image in Linux 2025-12-04T09:45:05.0647391Z # job could then download the pre-built image as usual 2025-12-04T09:45:05.0647877Z if [[ -d "${DOCKER_BUILD_DIR}" ]] && [[ -f "${DOCKER_BUILD_DIR}/${DOCKER_BUILD_SCRIPT}" ]] && [[ "${USE_CUSTOM_DOCKER_REGISTRY}" == "true" ]]; then 2025-12-04T09:45:05.0648324Z  echo "skip=false" >> "${GITHUB_OUTPUT}" 2025-12-04T09:45:05.0648558Z else 2025-12-04T09:45:05.0648745Z  echo "skip=true" >> "${GITHUB_OUTPUT}" 2025-12-04T09:45:05.0649058Z  echo "docker-image=${DOCKER_IMAGE_NAME}" >> "${GITHUB_OUTPUT}" 2025-12-04T09:45:05.0649333Z  2025-12-04T09:45:05.0649726Z  echo "Not using custom ECR registry. Either it was not requested or there is no Docker build script in the ${REPO_NAME} repo..." 2025-12-04T09:45:05.0650187Z  exit 0 2025-12-04T09:45:05.0650351Z fi 2025-12-04T09:45:05.0650501Z  2025-12-04T09:45:05.0650742Z if [[ "${DOCKER_IMAGE_NAME}" == *"${DOCKER_REGISTRY}/${REPO_NAME}"* ]]; then 2025-12-04T09:45:05.0651167Z  # The docker image name already includes the ECR prefix and tag, so we can just 2025-12-04T09:45:05.0651537Z  # use it as it is, but first let's extract the tag 2025-12-04T09:45:05.0651895Z  DOCKER_TAG=$(echo "${DOCKER_IMAGE_NAME}" | awk -F '[:,]' '{print $2}') 2025-12-04T09:45:05.0652270Z  echo "docker-tag=${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2025-12-04T09:45:05.0652611Z  echo "docker-image=${DOCKER_IMAGE_NAME}" >> "${GITHUB_OUTPUT}" 2025-12-04T09:45:05.0652891Z else 2025-12-04T09:45:05.0653076Z  if [[ "${DOCKER_IMAGE_NAME}" == *:* ]]; then 2025-12-04T09:45:05.0653349Z  CUSTOM_TAG_PREFIX=${DOCKER_IMAGE_NAME#*:} 2025-12-04T09:45:05.0653622Z  DOCKER_IMAGE_NAME=${DOCKER_IMAGE_NAME%%:*} 2025-12-04T09:45:05.0653854Z  fi 2025-12-04T09:45:05.0654171Z  DOCKER_TAG=${CUSTOM_TAG_PREFIX:+${CUSTOM_TAG_PREFIX}-}$(git rev-parse HEAD:"${DOCKER_BUILD_DIR}") 2025-12-04T09:45:05.0654611Z  echo "docker-tag=${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2025-12-04T09:45:05.0655067Z  echo "docker-image=${DOCKER_REGISTRY}/${REPO_NAME}/${DOCKER_IMAGE_NAME}:${DOCKER_TAG}" >> "${GITHUB_OUTPUT}" 2025-12-04T09:45:05.0656027Z  echo "custom-tag-prefix=${CUSTOM_TAG_PREFIX}" >> "${GITHUB_OUTPUT}" 2025-12-04T09:45:05.0656339Z fi 2025-12-04T09:45:05.0664180Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:45:05.0664463Z env: 2025-12-04T09:45:05.0664622Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:05.0664818Z REPO_NAME: pytorch 2025-12-04T09:45:05.0665582Z DOCKER_IMAGE_NAME: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:45:05.0666225Z DOCKER_BUILD_DIR: .ci/docker 2025-12-04T09:45:05.0666427Z DOCKER_BUILD_SCRIPT: ./build.sh 2025-12-04T09:45:05.0666705Z DOCKER_REGISTRY: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:45:05.0666993Z USE_CUSTOM_DOCKER_REGISTRY: true 2025-12-04T09:45:05.0667198Z CUSTOM_TAG_PREFIX: 2025-12-04T09:45:05.0667460Z ##[endgroup] 2025-12-04T09:45:05.0695031Z + [[ -d .ci/docker ]] 2025-12-04T09:45:05.0695338Z + [[ -f .ci/docker/./build.sh ]] 2025-12-04T09:45:05.0695597Z + [[ true == \t\r\u\e ]] 2025-12-04T09:45:05.0695833Z + echo skip=false 2025-12-04T09:45:05.0696712Z + [[ 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a == *\3\0\8\5\3\5\3\8\5\1\1\4\.\d\k\r\.\e\c\r\.\u\s\-\e\a\s\t\-\1\.\a\m\a\z\o\n\a\w\s\.\c\o\m\/\p\y\t\o\r\c\h* ]] 2025-12-04T09:45:05.0702602Z ++ echo 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:45:05.0703333Z ++ awk -F '[:,]' '{print $2}' 2025-12-04T09:45:05.0728434Z + DOCKER_TAG=pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:45:05.0729733Z + echo docker-tag=pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:45:05.0731974Z + echo docker-image=308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:45:05.0760234Z ##[group]Run set +e 2025-12-04T09:45:05.0760448Z set +e 2025-12-04T09:45:05.0760611Z set -x 2025-12-04T09:45:05.0760769Z  2025-12-04T09:45:05.0760915Z login() { 2025-12-04T09:45:05.0761258Z  aws ecr get-login-password --region us-east-1 | docker login -u AWS --password-stdin "$1" 2025-12-04T09:45:05.0761642Z } 2025-12-04T09:45:05.0761800Z  2025-12-04T09:45:05.0762191Z retry () { 2025-12-04T09:45:05.0762408Z  $* || (sleep 1 && $*) || (sleep 2 && $*) 2025-12-04T09:45:05.0762629Z } 2025-12-04T09:45:05.0762771Z  2025-12-04T09:45:05.0762934Z retry login "${DOCKER_REGISTRY}" 2025-12-04T09:45:05.0763151Z  2025-12-04T09:45:05.0763307Z START_TIME=$(date +%s) 2025-12-04T09:45:05.0763509Z # Wait up to 120 minutes 2025-12-04T09:45:05.0763777Z while [[ $(( $(date +%s) - 7200 )) -lt $START_TIME ]]; do 2025-12-04T09:45:05.0764124Z  # Check if image already exists, if it does then skip building it 2025-12-04T09:45:05.0764468Z  if docker manifest inspect "${DOCKER_IMAGE}"; then 2025-12-04T09:45:05.0764721Z  exit 0 2025-12-04T09:45:05.0764884Z  fi 2025-12-04T09:45:05.0765024Z  2025-12-04T09:45:05.0765294Z  # NB: This flag is used by Docker build workflow to push the image to ECR, so we can 2025-12-04T09:45:05.0765752Z  # use this to differentiate between the Docker build and regular build jobs. For the 2025-12-04T09:45:05.0766213Z  # latter, it will wait for the Docker images to become available before continuing 2025-12-04T09:45:05.0766570Z  if [ "${DOCKER_PUSH:-false}" == "true" ]; then 2025-12-04T09:45:05.0766859Z  # It's a Docker build job, let's build the image 2025-12-04T09:45:05.0767260Z  break 2025-12-04T09:45:05.0767432Z  else 2025-12-04T09:45:05.0767668Z  # It's a regular build job, wait for the image to become available 2025-12-04T09:45:05.0767957Z  sleep 300 2025-12-04T09:45:05.0768126Z  fi 2025-12-04T09:45:05.0768271Z done 2025-12-04T09:45:05.0768415Z  2025-12-04T09:45:05.0768672Z # NB: This part requires a full checkout. Otherwise, the merge base will 2025-12-04T09:45:05.0769206Z # be empty. The default action would be to continue rebuild the image 2025-12-04T09:45:05.0769579Z if [[ "$BASE_REVISION" = "$(git rev-parse HEAD)" ]]; then 2025-12-04T09:45:05.0769908Z  # if we're on the base branch then use the parent commit 2025-12-04T09:45:05.0770190Z  MERGE_BASE=$(git rev-parse HEAD~) 2025-12-04T09:45:05.0770401Z else 2025-12-04T09:45:05.0770630Z  # otherwise we're on a PR, so use the most recent base commit 2025-12-04T09:45:05.0770973Z  MERGE_BASE=$(git merge-base HEAD "$BASE_REVISION") 2025-12-04T09:45:05.0771224Z fi 2025-12-04T09:45:05.0771371Z  2025-12-04T09:45:05.0771535Z if [[ -z "${MERGE_BASE}" ]]; then 2025-12-04T09:45:05.0771780Z  echo "rebuild=true" >> "${GITHUB_OUTPUT}" 2025-12-04T09:45:05.0772003Z  2025-12-04T09:45:05.0772329Z  echo "Finding merge base only works with full checkout, please set fetch-depth to 0, continuing ..." 2025-12-04T09:45:05.0772715Z  exit 0 2025-12-04T09:45:05.0772865Z fi 2025-12-04T09:45:05.0773009Z  2025-12-04T09:45:05.0773220Z if ! git rev-parse "${MERGE_BASE}:${DOCKER_BUILD_DIR}"; then 2025-12-04T09:45:05.0773691Z  echo "Directory '${DOCKER_BUILD_DIR}' not found in commit $MERGE_BASE, you should rebase onto a more recent commit" 2025-12-04T09:45:05.0774090Z  exit 1 2025-12-04T09:45:05.0774241Z fi 2025-12-04T09:45:05.0774389Z  2025-12-04T09:45:05.0774628Z PREVIOUS_DOCKER_TAG=$(git rev-parse "${MERGE_BASE}:${DOCKER_BUILD_DIR}") 2025-12-04T09:45:05.0775081Z # If no image exists but the hash is the same as the previous hash then we should error out here 2025-12-04T09:45:05.0775502Z if [[ "${PREVIOUS_DOCKER_TAG}" == "${DOCKER_TAG}" ]]; then 2025-12-04T09:45:05.0775965Z  echo "WARNING: Something has gone wrong and the previous image isn't available for the merge-base of your branch" 2025-12-04T09:45:05.0776484Z  echo " Will re-build docker image to store in local cache, TTS may be longer" 2025-12-04T09:45:05.0776795Z fi 2025-12-04T09:45:05.0776938Z  2025-12-04T09:45:05.0777107Z echo "rebuild=true" >> "${GITHUB_OUTPUT}" 2025-12-04T09:45:05.0784636Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:45:05.0784921Z env: 2025-12-04T09:45:05.0785080Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:05.0785278Z DOCKER_BUILD_DIR: .ci/docker 2025-12-04T09:45:05.0785535Z BASE_REVISION: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:45:05.0786201Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:45:05.0787004Z DOCKER_TAG: pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:45:05.0787598Z DOCKER_REGISTRY: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:45:05.0787878Z DOCKER_PUSH: 2025-12-04T09:45:05.0788044Z ##[endgroup] 2025-12-04T09:45:05.0815045Z + retry login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:45:05.0815406Z + login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:45:05.0817873Z + aws ecr get-login-password --region us-east-1 2025-12-04T09:45:05.0819232Z + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:45:05.5507084Z WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json. 2025-12-04T09:45:05.5507963Z Configure a credential helper to remove this warning. See 2025-12-04T09:45:05.5508386Z https://docs.docker.com/engine/reference/commandline/login/#credentials-store 2025-12-04T09:45:05.5508678Z 2025-12-04T09:45:05.5509679Z Login Succeeded 2025-12-04T09:45:05.5528961Z ++ date +%s 2025-12-04T09:45:05.5541280Z + START_TIME=1764841505 2025-12-04T09:45:05.5544768Z ++ date +%s 2025-12-04T09:45:05.5557349Z + [[ 1764834305 -lt 1764841505 ]] 2025-12-04T09:45:05.5558225Z + docker manifest inspect 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:45:05.7764379Z { 2025-12-04T09:45:05.7764703Z "schemaVersion": 2, 2025-12-04T09:45:05.7765201Z "mediaType": "application/vnd.docker.distribution.manifest.v2+json", 2025-12-04T09:45:05.7765531Z "config": { 2025-12-04T09:45:05.7765790Z "mediaType": "application/vnd.docker.container.image.v1+json", 2025-12-04T09:45:05.7766116Z "size": 34864, 2025-12-04T09:45:05.7766541Z "digest": "sha256:add7313791033822205cdb3cf32096534b2cfaa4855bd48119b59000bfe00301" 2025-12-04T09:45:05.7767062Z }, 2025-12-04T09:45:05.7767214Z "layers": [ 2025-12-04T09:45:05.7767361Z { 2025-12-04T09:45:05.7767743Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7768238Z "size": 30447951, 2025-12-04T09:45:05.7768758Z "digest": "sha256:63e5bc7682b85ae57a1221210f64d62e7a90b0a30f19af4ca734b8242ae49d63" 2025-12-04T09:45:05.7769163Z }, 2025-12-04T09:45:05.7769306Z { 2025-12-04T09:45:05.7769540Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7769936Z "size": 1554, 2025-12-04T09:45:05.7770250Z "digest": "sha256:0678d56345c994444b77bb70b1177189d23e794748b1d75ffc45d227c7dea94a" 2025-12-04T09:45:05.7770566Z }, 2025-12-04T09:45:05.7770687Z { 2025-12-04T09:45:05.7770906Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7771198Z "size": 313275661, 2025-12-04T09:45:05.7771509Z "digest": "sha256:45f5c9ddfce78349dff3d5edfbaa0310ae17311f66abdcd7e00fa21b500e801c" 2025-12-04T09:45:05.7771848Z }, 2025-12-04T09:45:05.7771975Z { 2025-12-04T09:45:05.7772198Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7772485Z "size": 787, 2025-12-04T09:45:05.7772768Z "digest": "sha256:086b1df51ac1162d9c45698e9dfaf91c6c222c8bd9ab01797ac8f9344bc8044f" 2025-12-04T09:45:05.7773096Z }, 2025-12-04T09:45:05.7773219Z { 2025-12-04T09:45:05.7773451Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7773738Z "size": 106, 2025-12-04T09:45:05.7774023Z "digest": "sha256:fe8a7b64bf98352f89057bcba66beef2fb44cc05fbd3606abccd8e86cf476234" 2025-12-04T09:45:05.7774351Z }, 2025-12-04T09:45:05.7774490Z { 2025-12-04T09:45:05.7774711Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7775011Z "size": 703, 2025-12-04T09:45:05.7775290Z "digest": "sha256:7680723e9a578033dd106b45784c639f06cc8adb1f5239ec513d9de01087c1af" 2025-12-04T09:45:05.7775614Z }, 2025-12-04T09:45:05.7775738Z { 2025-12-04T09:45:05.7775960Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7776264Z "size": 1216, 2025-12-04T09:45:05.7776554Z "digest": "sha256:9c5027aeeb4e3101f48c1d2e400c387110e1009e42497ee801f1b4b7f7edb5c0" 2025-12-04T09:45:05.7776882Z }, 2025-12-04T09:45:05.7777018Z { 2025-12-04T09:45:05.7777382Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7777669Z "size": 483, 2025-12-04T09:45:05.7777929Z "digest": "sha256:9a56521103600bd37a1e7c1191b5136c2d738c092f8a6701499f7068a32c2628" 2025-12-04T09:45:05.7778252Z }, 2025-12-04T09:45:05.7778384Z { 2025-12-04T09:45:05.7778598Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7778882Z "size": 110361875, 2025-12-04T09:45:05.7780130Z "digest": "sha256:375c4427e9141269458333b1463fdb219e736fd6231ec1c56c625c48437ace77" 2025-12-04T09:45:05.7780441Z }, 2025-12-04T09:45:05.7780561Z { 2025-12-04T09:45:05.7780779Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7781054Z "size": 4961, 2025-12-04T09:45:05.7781327Z "digest": "sha256:a86faaa7dbdd70e678e5ea20072637ee42618921ca8f80ca089f789325d4b0c2" 2025-12-04T09:45:05.7781646Z }, 2025-12-04T09:45:05.7781781Z { 2025-12-04T09:45:05.7782132Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7782413Z "size": 1755, 2025-12-04T09:45:05.7782684Z "digest": "sha256:fb7848686804957915d98f8655ef6da0fe4c521b50a82aefdebf475983505a15" 2025-12-04T09:45:05.7782989Z }, 2025-12-04T09:45:05.7783114Z { 2025-12-04T09:45:05.7783343Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7783614Z "size": 724, 2025-12-04T09:45:05.7783882Z "digest": "sha256:3541df015cdb7e8925273399d28e56c31b3c9196f00439ac2925537b173b1f84" 2025-12-04T09:45:05.7784192Z }, 2025-12-04T09:45:05.7784327Z { 2025-12-04T09:45:05.7784539Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7784814Z "size": 543, 2025-12-04T09:45:05.7785085Z "digest": "sha256:79dc80f426b29d4ae9157b967050b03e66aa0c4b1295b944a1dd70106be87066" 2025-12-04T09:45:05.7785392Z }, 2025-12-04T09:45:05.7785519Z { 2025-12-04T09:45:05.7785738Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7786014Z "size": 3185190117, 2025-12-04T09:45:05.7786321Z "digest": "sha256:a13fcc1b90bb9c251ebe7ef2a03c4cb3afa1c8bdafe84f5f85136773059a3735" 2025-12-04T09:45:05.7786656Z }, 2025-12-04T09:45:05.7786788Z { 2025-12-04T09:45:05.7787015Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7787424Z "size": 32, 2025-12-04T09:45:05.7787712Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T09:45:05.7788033Z }, 2025-12-04T09:45:05.7788161Z { 2025-12-04T09:45:05.7788385Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7788659Z "size": 396, 2025-12-04T09:45:05.7788942Z "digest": "sha256:549db4d6c618ecd9534658a233e3c90508f82d8735f965c2786b2eaa078869e5" 2025-12-04T09:45:05.7789260Z }, 2025-12-04T09:45:05.7789380Z { 2025-12-04T09:45:05.7789597Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7789883Z "size": 236860, 2025-12-04T09:45:05.7790166Z "digest": "sha256:5c63528cb580001e65104f4cb0809bf0673a00f989a7db42fd6d86aa1ec27cee" 2025-12-04T09:45:05.7790706Z }, 2025-12-04T09:45:05.7790840Z { 2025-12-04T09:45:05.7791096Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7791513Z "size": 231, 2025-12-04T09:45:05.7791799Z "digest": "sha256:75bd83b989a44e4d4119a3f972891025eb0e9ce95cfbe4a0ca5cdbe7130028d6" 2025-12-04T09:45:05.7792126Z }, 2025-12-04T09:45:05.7792248Z { 2025-12-04T09:45:05.7792462Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7792744Z "size": 3043497, 2025-12-04T09:45:05.7793021Z "digest": "sha256:de6e78970f517178cb91f36cd02bd9ca7b72a08fb82a0f9007516026f258c035" 2025-12-04T09:45:05.7793348Z }, 2025-12-04T09:45:05.7793474Z { 2025-12-04T09:45:05.7793683Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7793966Z "size": 1472, 2025-12-04T09:45:05.7794257Z "digest": "sha256:e13ed7c7e4736e81dc21af755b3363eb26e4d3b2f1ca988dfe65effa47d8fa42" 2025-12-04T09:45:05.7794580Z }, 2025-12-04T09:45:05.7794705Z { 2025-12-04T09:45:05.7794922Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7795195Z "size": 481, 2025-12-04T09:45:05.7795482Z "digest": "sha256:6e2949bcb74152577a0f20c38bcb6dd80f5e68427e3e531a80e08c9ecc73a979" 2025-12-04T09:45:05.7795803Z }, 2025-12-04T09:45:05.7796040Z { 2025-12-04T09:45:05.7796251Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7796530Z "size": 202, 2025-12-04T09:45:05.7796806Z "digest": "sha256:14d69d9aaec70287efd2fd35c4f93e43a29a4098458cc9fca1c93f02ad7356cb" 2025-12-04T09:45:05.7797121Z }, 2025-12-04T09:45:05.7797246Z { 2025-12-04T09:45:05.7797463Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7797734Z "size": 607, 2025-12-04T09:45:05.7798163Z "digest": "sha256:5c02769dd8e5bba2f7f5fd84bde9595fcb3bdbffcae497503fa846f9b5e78bf5" 2025-12-04T09:45:05.7798498Z }, 2025-12-04T09:45:05.7798618Z { 2025-12-04T09:45:05.7798834Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7799115Z "size": 7889619584, 2025-12-04T09:45:05.7799402Z "digest": "sha256:35041ce524ac4afec40ecd73b1393c830614f1f79d43a6439767a6c7d5b7027b" 2025-12-04T09:45:05.7799724Z }, 2025-12-04T09:45:05.7799850Z { 2025-12-04T09:45:05.7800067Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7800337Z "size": 830, 2025-12-04T09:45:05.7800612Z "digest": "sha256:2fa92dc5885e080e049ceb4139288b6c0e39fab34256945708b08ea55a1f7a0b" 2025-12-04T09:45:05.7800923Z }, 2025-12-04T09:45:05.7801045Z { 2025-12-04T09:45:05.7801266Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7801557Z "size": 33451739, 2025-12-04T09:45:05.7801845Z "digest": "sha256:2b85eafbd92a0e70a0a70154ad8bf4584095e576d95873368f30373f5966714a" 2025-12-04T09:45:05.7802170Z }, 2025-12-04T09:45:05.7802299Z { 2025-12-04T09:45:05.7802510Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7802791Z "size": 104, 2025-12-04T09:45:05.7803071Z "digest": "sha256:ff755a4ddad7880f23c6b767d432d6f1eafdb62b3ea18f8a98e22c441c099fcb" 2025-12-04T09:45:05.7803393Z }, 2025-12-04T09:45:05.7803527Z { 2025-12-04T09:45:05.7803748Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7804030Z "size": 1496, 2025-12-04T09:45:05.7804292Z "digest": "sha256:09eb41bdf42d8605b57b2363348154140904dec914b34a67298b82122bfce2b3" 2025-12-04T09:45:05.7804600Z }, 2025-12-04T09:45:05.7804728Z { 2025-12-04T09:45:05.7804933Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7805213Z "size": 458787828, 2025-12-04T09:45:05.7805493Z "digest": "sha256:11ede4d59e935e62f41b33220fe871794ab5e57ce724173b713368977683bcf6" 2025-12-04T09:45:05.7805805Z }, 2025-12-04T09:45:05.7805930Z { 2025-12-04T09:45:05.7806146Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7806430Z "size": 164, 2025-12-04T09:45:05.7806704Z "digest": "sha256:1283cd8f801a142172f3ab76fd472df8583223d9437de3e4d18d8cf98ea3fa98" 2025-12-04T09:45:05.7807015Z }, 2025-12-04T09:45:05.7807142Z { 2025-12-04T09:45:05.7807348Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7807626Z "size": 346, 2025-12-04T09:45:05.7807894Z "digest": "sha256:024fa855425fa524ad4500660cf61d53be62b99556d31b8b280d14caba434a35" 2025-12-04T09:45:05.7808203Z }, 2025-12-04T09:45:05.7808344Z { 2025-12-04T09:45:05.7808565Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7808838Z "size": 32, 2025-12-04T09:45:05.7809115Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T09:45:05.7809436Z }, 2025-12-04T09:45:05.7809557Z { 2025-12-04T09:45:05.7809773Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7810049Z "size": 106, 2025-12-04T09:45:05.7810322Z "digest": "sha256:303e6747a62efecf5efa1f97d0e66b40a3b39da8d79a51f75b89f4c92ae7ec52" 2025-12-04T09:45:05.7810648Z }, 2025-12-04T09:45:05.7810776Z { 2025-12-04T09:45:05.7810990Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7811262Z "size": 424, 2025-12-04T09:45:05.7811631Z "digest": "sha256:3017cdf4838bcc9a33daebc07487f8ae1f6bd6e7ce8322c14f5480e8db9ef90e" 2025-12-04T09:45:05.7811962Z }, 2025-12-04T09:45:05.7812086Z { 2025-12-04T09:45:05.7812301Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7812580Z "size": 19309374, 2025-12-04T09:45:05.7812864Z "digest": "sha256:6b6cd1c358e886dc6ed7fd46ac4bcc1a0a73b7b1301739ea1953478ee5d83f50" 2025-12-04T09:45:05.7813187Z }, 2025-12-04T09:45:05.7813315Z { 2025-12-04T09:45:05.7813607Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7813892Z "size": 108, 2025-12-04T09:45:05.7814168Z "digest": "sha256:b2dd045011241d1cf8889e2a7369d9fe4844dfe15529b520ccd6a59bd3c1532e" 2025-12-04T09:45:05.7814483Z }, 2025-12-04T09:45:05.7814604Z { 2025-12-04T09:45:05.7814827Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7815110Z "size": 827, 2025-12-04T09:45:05.7815374Z "digest": "sha256:55adc51fe5897031d4cf2f2b8fd162213f6e46a52848630c616606271b97952e" 2025-12-04T09:45:05.7815695Z }, 2025-12-04T09:45:05.7815826Z { 2025-12-04T09:45:05.7816043Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7816326Z "size": 724, 2025-12-04T09:45:05.7816589Z "digest": "sha256:3541df015cdb7e8925273399d28e56c31b3c9196f00439ac2925537b173b1f84" 2025-12-04T09:45:05.7816892Z }, 2025-12-04T09:45:05.7817016Z { 2025-12-04T09:45:05.7817232Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7817506Z "size": 149, 2025-12-04T09:45:05.7817766Z "digest": "sha256:a43ca0e4b837964b12b7469194cfe939c26de027298040028975324dce25938a" 2025-12-04T09:45:05.7818072Z }, 2025-12-04T09:45:05.7818195Z { 2025-12-04T09:45:05.7818401Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7818676Z "size": 138, 2025-12-04T09:45:05.7818948Z "digest": "sha256:b7212f17fd1404837fcfdd086dd0e2667931e4db377d45d8d89a44390c84e11d" 2025-12-04T09:45:05.7819270Z }, 2025-12-04T09:45:05.7819398Z { 2025-12-04T09:45:05.7819608Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7819881Z "size": 141, 2025-12-04T09:45:05.7820152Z "digest": "sha256:083e42cac090e6486c35f392b64ee54448f5e4aa947003aeb3e1f92c8ea5c099" 2025-12-04T09:45:05.7820469Z }, 2025-12-04T09:45:05.7820589Z { 2025-12-04T09:45:05.7820803Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7821081Z "size": 32, 2025-12-04T09:45:05.7821361Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T09:45:05.7821678Z }, 2025-12-04T09:45:05.7821803Z { 2025-12-04T09:45:05.7822016Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7822286Z "size": 223, 2025-12-04T09:45:05.7822575Z "digest": "sha256:0a00b784a4aac341795729b254f7edd09e811b7f51d0c58e0e6bfeeee6940503" 2025-12-04T09:45:05.7822899Z }, 2025-12-04T09:45:05.7823026Z { 2025-12-04T09:45:05.7823242Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7823519Z "size": 255, 2025-12-04T09:45:05.7823782Z "digest": "sha256:c6173c779f7ba143a21214ea5f032b141863a37ceb4c0ac01d3248c216ce5241" 2025-12-04T09:45:05.7824101Z }, 2025-12-04T09:45:05.7824246Z { 2025-12-04T09:45:05.7824458Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7824736Z "size": 145520672, 2025-12-04T09:45:05.7825027Z "digest": "sha256:ed3d1e3387b924585c332bf1bc252fa159cd0d25256a874043ff0141b1ab5ff7" 2025-12-04T09:45:05.7825344Z }, 2025-12-04T09:45:05.7825465Z { 2025-12-04T09:45:05.7825693Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7825974Z "size": 106, 2025-12-04T09:45:05.7826235Z "digest": "sha256:b29343478586aeee19d2a622661716f6f1591280c890f49b727a8da13a610784" 2025-12-04T09:45:05.7826547Z }, 2025-12-04T09:45:05.7826775Z { 2025-12-04T09:45:05.7826988Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7827361Z "size": 312293530, 2025-12-04T09:45:05.7827647Z "digest": "sha256:c6f0520487fb506bc4601fd84d5f28d8a76b203e004731e4b2067c2ab1a14e0b" 2025-12-04T09:45:05.7827958Z }, 2025-12-04T09:45:05.7828082Z { 2025-12-04T09:45:05.7828296Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7828581Z "size": 3058011133, 2025-12-04T09:45:05.7828958Z "digest": "sha256:148171691cd4c4d20310d490d4b4dd903490d04ea07fb8f7e668a28768683e9a" 2025-12-04T09:45:05.7829289Z }, 2025-12-04T09:45:05.7829414Z { 2025-12-04T09:45:05.7829623Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7829898Z "size": 129, 2025-12-04T09:45:05.7830172Z "digest": "sha256:2c666d30ed77fff9ff1167d41cd645dad98280fcbe941f5bc3828c7ae66b1287" 2025-12-04T09:45:05.7830485Z }, 2025-12-04T09:45:05.7830620Z { 2025-12-04T09:45:05.7830848Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7831124Z "size": 880, 2025-12-04T09:45:05.7831396Z "digest": "sha256:5d8d3a0a98e012c5068e0f3bae5a03e3148ecf2d063634eee4c9241a1e3fdfb5" 2025-12-04T09:45:05.7831709Z }, 2025-12-04T09:45:05.7832035Z { 2025-12-04T09:45:05.7832372Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7832736Z "size": 724, 2025-12-04T09:45:05.7833099Z "digest": "sha256:3541df015cdb7e8925273399d28e56c31b3c9196f00439ac2925537b173b1f84" 2025-12-04T09:45:05.7833535Z }, 2025-12-04T09:45:05.7833764Z { 2025-12-04T09:45:05.7834069Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7834481Z "size": 139, 2025-12-04T09:45:05.7834850Z "digest": "sha256:b06bafce9e817295d8127207747c80aa18e04392ff0875844fc30a1e794a8a0c" 2025-12-04T09:45:05.7835330Z }, 2025-12-04T09:45:05.7863596Z { 2025-12-04T09:45:05.7863873Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7864185Z "size": 32, 2025-12-04T09:45:05.7864521Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T09:45:05.7864874Z }, 2025-12-04T09:45:05.7864998Z { 2025-12-04T09:45:05.7865227Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7865510Z "size": 159, 2025-12-04T09:45:05.7865812Z "digest": "sha256:15e0d7e4590d3d8f598d05aec3a92f891bf8b4605bcc38cc2de852b6014ef8f3" 2025-12-04T09:45:05.7866141Z }, 2025-12-04T09:45:05.7866266Z { 2025-12-04T09:45:05.7866500Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7866789Z "size": 1011, 2025-12-04T09:45:05.7867073Z "digest": "sha256:a514bd1add3164d8d7ca99aa19294c4ed8b97b074635d98714c4f598a959f4cd" 2025-12-04T09:45:05.7867452Z }, 2025-12-04T09:45:05.7867577Z { 2025-12-04T09:45:05.7867809Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7868089Z "size": 724, 2025-12-04T09:45:05.7868364Z "digest": "sha256:3541df015cdb7e8925273399d28e56c31b3c9196f00439ac2925537b173b1f84" 2025-12-04T09:45:05.7868694Z }, 2025-12-04T09:45:05.7868818Z { 2025-12-04T09:45:05.7869049Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7869324Z "size": 134, 2025-12-04T09:45:05.7869591Z "digest": "sha256:57b84ee6000204f27a1d9bca199b19be4c86ecd324540dbdf239c56a6c3b34ea" 2025-12-04T09:45:05.7869897Z }, 2025-12-04T09:45:05.7870016Z { 2025-12-04T09:45:05.7870223Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7870491Z "size": 32, 2025-12-04T09:45:05.7870772Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T09:45:05.7871083Z }, 2025-12-04T09:45:05.7871199Z { 2025-12-04T09:45:05.7871407Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7871676Z "size": 157, 2025-12-04T09:45:05.7872142Z "digest": "sha256:b8babeff6d817a5961dddc15c6bdfdbd05da187fae75d5804015f99fd7c066d8" 2025-12-04T09:45:05.7872466Z }, 2025-12-04T09:45:05.7872593Z { 2025-12-04T09:45:05.7872819Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7873098Z "size": 602, 2025-12-04T09:45:05.7873377Z "digest": "sha256:83779ddf6a85ab387f64a45f274cba245b69e4fd1931ff0b5d7d3efd4b7a43bc" 2025-12-04T09:45:05.7873694Z }, 2025-12-04T09:45:05.7873826Z { 2025-12-04T09:45:05.7874198Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7874488Z "size": 724, 2025-12-04T09:45:05.7874766Z "digest": "sha256:3541df015cdb7e8925273399d28e56c31b3c9196f00439ac2925537b173b1f84" 2025-12-04T09:45:05.7875082Z }, 2025-12-04T09:45:05.7875227Z { 2025-12-04T09:45:05.7875448Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7875733Z "size": 155, 2025-12-04T09:45:05.7876009Z "digest": "sha256:8b7620c0d736cc79381207ce5afe2af90f0cd7f0cd394577d2c9520d7f74762f" 2025-12-04T09:45:05.7876328Z }, 2025-12-04T09:45:05.7876460Z { 2025-12-04T09:45:05.7876683Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7876958Z "size": 32, 2025-12-04T09:45:05.7877236Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T09:45:05.7877566Z }, 2025-12-04T09:45:05.7877693Z { 2025-12-04T09:45:05.7877914Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7878199Z "size": 188, 2025-12-04T09:45:05.7878477Z "digest": "sha256:3bcfa090e4efd3677425f76baea9f1e0c50a75d8c6b5713ec05310f1dff24539" 2025-12-04T09:45:05.7878791Z }, 2025-12-04T09:45:05.7878920Z { 2025-12-04T09:45:05.7879139Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7879434Z "size": 1370, 2025-12-04T09:45:05.7879719Z "digest": "sha256:eb0504ec4d9218a79896b604f73dc0ea5a0f96266ad9c2cdbbbe5f0f18222694" 2025-12-04T09:45:05.7880041Z }, 2025-12-04T09:45:05.7880163Z { 2025-12-04T09:45:05.7880379Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7880656Z "size": 32, 2025-12-04T09:45:05.7880927Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T09:45:05.7881250Z }, 2025-12-04T09:45:05.7881381Z { 2025-12-04T09:45:05.7881593Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7881870Z "size": 136, 2025-12-04T09:45:05.7882154Z "digest": "sha256:15d0fec09d7b196a1462d51516ee90fc3443ba178d3e56d59cacf32146b4321d" 2025-12-04T09:45:05.7882483Z }, 2025-12-04T09:45:05.7882606Z { 2025-12-04T09:45:05.7882823Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7883106Z "size": 528, 2025-12-04T09:45:05.7883379Z "digest": "sha256:cca81fcc62a949959ca4dd3c9056fb293d548ef8607127eeeef6cfd3a8897ca8" 2025-12-04T09:45:05.7883719Z }, 2025-12-04T09:45:05.7883853Z { 2025-12-04T09:45:05.7884064Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7884343Z "size": 32, 2025-12-04T09:45:05.7884624Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T09:45:05.7884948Z }, 2025-12-04T09:45:05.7885077Z { 2025-12-04T09:45:05.7885296Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7885570Z "size": 104, 2025-12-04T09:45:05.7885856Z "digest": "sha256:b0b8f9b5c6ab98db9cd830dc584e1b6aec9add139e4cc48d8c243d36691e25b4" 2025-12-04T09:45:05.7886181Z }, 2025-12-04T09:45:05.7886311Z { 2025-12-04T09:45:05.7886524Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7886804Z "size": 435, 2025-12-04T09:45:05.7887076Z "digest": "sha256:0606ca4d47a8a70e91e92b03ca51a85e731641b09342136a54ef2f2a6d9dfb44" 2025-12-04T09:45:05.7887399Z }, 2025-12-04T09:45:05.7887532Z { 2025-12-04T09:45:05.7887848Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7888130Z "size": 32, 2025-12-04T09:45:05.7888411Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T09:45:05.7888788Z }, 2025-12-04T09:45:05.7888934Z { 2025-12-04T09:45:05.7889196Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7889530Z "size": 109, 2025-12-04T09:45:05.7889956Z "digest": "sha256:2f80a4e1b3b95ed67bb781ea787e8a63e46de79117d9d8e65c257072b38afa2d" 2025-12-04T09:45:05.7890333Z }, 2025-12-04T09:45:05.7890474Z { 2025-12-04T09:45:05.7890693Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7890964Z "size": 1896, 2025-12-04T09:45:05.7891243Z "digest": "sha256:35c916fb1bd057e517dcab78c3a2a018e68096d8993892ad84f47562d37ae352" 2025-12-04T09:45:05.7891553Z }, 2025-12-04T09:45:05.7891674Z { 2025-12-04T09:45:05.7891888Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7892187Z "size": 197526165, 2025-12-04T09:45:05.7892469Z "digest": "sha256:195537b7dafc96192f768323b1a8cc2a914d41959849b73198579576b0872a44" 2025-12-04T09:45:05.7892784Z }, 2025-12-04T09:45:05.7892914Z { 2025-12-04T09:45:05.7893129Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7893405Z "size": 106, 2025-12-04T09:45:05.7893677Z "digest": "sha256:dc454fd3967e5735b2498b7f1d958a2c626987d5e4ce225ca98da3cd945b59f3" 2025-12-04T09:45:05.7893999Z }, 2025-12-04T09:45:05.7894132Z { 2025-12-04T09:45:05.7894349Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7894628Z "size": 165, 2025-12-04T09:45:05.7894893Z "digest": "sha256:701b34f115fa897181c046dc37288e87cbc3ad74c36a9e2224b5bfe7c5703afb" 2025-12-04T09:45:05.7895216Z }, 2025-12-04T09:45:05.7895345Z { 2025-12-04T09:45:05.7895561Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7895856Z "size": 7944, 2025-12-04T09:45:05.7896143Z "digest": "sha256:39cefc00ffedebc9098261c798408b87a20c95a88fccb110594077f48dadf760" 2025-12-04T09:45:05.7896455Z }, 2025-12-04T09:45:05.7896583Z { 2025-12-04T09:45:05.7896812Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7897093Z "size": 8071, 2025-12-04T09:45:05.7897371Z "digest": "sha256:6ae51eb61a325b2c2995a5088c81aa20821b75be65b5aa722c7c40556b5d03ea" 2025-12-04T09:45:05.7897691Z }, 2025-12-04T09:45:05.7897834Z { 2025-12-04T09:45:05.7898052Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7898336Z "size": 304, 2025-12-04T09:45:05.7898617Z "digest": "sha256:1fd5341e66dfc0c1ae23af014641a92a6fd02640c528fe6d4dc55921ed659a26" 2025-12-04T09:45:05.7898934Z }, 2025-12-04T09:45:05.7899069Z { 2025-12-04T09:45:05.7899299Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7899573Z "size": 13364291, 2025-12-04T09:45:05.7899868Z "digest": "sha256:72a7c87e35e40ab796f90aee1b51add7902f0cdc44406d2505b6c6a1f55a8da6" 2025-12-04T09:45:05.7900188Z }, 2025-12-04T09:45:05.7900311Z { 2025-12-04T09:45:05.7900531Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7900824Z "size": 108, 2025-12-04T09:45:05.7901104Z "digest": "sha256:ec36862ac98ebaac52ee1a8b1d162d45bd0e3bf59ae7e19c8f80ad3960b4c600" 2025-12-04T09:45:05.7901420Z }, 2025-12-04T09:45:05.7901544Z { 2025-12-04T09:45:05.7901761Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7902031Z "size": 54145699, 2025-12-04T09:45:05.7902313Z "digest": "sha256:05ddbf246e8add0e293474dbf88bb028d5a295a25ac59e8648a18db644377773" 2025-12-04T09:45:05.7902629Z }, 2025-12-04T09:45:05.7902758Z { 2025-12-04T09:45:05.7902972Z "mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip", 2025-12-04T09:45:05.7903244Z "size": 32, 2025-12-04T09:45:05.7903516Z "digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1" 2025-12-04T09:45:05.7903932Z } 2025-12-04T09:45:05.7904059Z ] 2025-12-04T09:45:05.7904182Z } 2025-12-04T09:45:05.7904321Z + exit 0 2025-12-04T09:45:05.7926936Z ##[group]Run set -eux 2025-12-04T09:45:05.7927136Z set -eux 2025-12-04T09:45:05.7927417Z # It's ok if this steps fails, it would then be an anonymous user like what we used to have 2025-12-04T09:45:05.7928334Z aws secretsmanager get-secret-value --secret-id docker_hub_readonly_token | jq --raw-output '.SecretString' | jq -r .docker_hub_readonly_token | docker login --username pytorchbot --password-stdin || true 2025-12-04T09:45:05.7936570Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:45:05.7936837Z env: 2025-12-04T09:45:05.7936987Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:05.7937172Z ##[endgroup] 2025-12-04T09:45:05.7968647Z + aws secretsmanager get-secret-value --secret-id docker_hub_readonly_token 2025-12-04T09:45:05.7969495Z + jq --raw-output .SecretString 2025-12-04T09:45:05.7970988Z + jq -r .docker_hub_readonly_token 2025-12-04T09:45:05.7972044Z + docker login --username pytorchbot --password-stdin 2025-12-04T09:45:06.3308179Z WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json. 2025-12-04T09:45:06.3309378Z Configure a credential helper to remove this warning. See 2025-12-04T09:45:06.3310648Z https://docs.docker.com/engine/reference/commandline/login/#credentials-store 2025-12-04T09:45:06.3311366Z 2025-12-04T09:45:06.3311566Z Login Succeeded 2025-12-04T09:45:06.3391972Z ##[group]Run tag=${ECR_DOCKER_IMAGE##*:} 2025-12-04T09:45:06.3392356Z tag=${ECR_DOCKER_IMAGE##*:} 2025-12-04T09:45:06.3392819Z echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}" 2025-12-04T09:45:06.3400795Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:45:06.3401069Z env: 2025-12-04T09:45:06.3401224Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:06.3401819Z ECR_DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:45:06.3402446Z ##[endgroup] 2025-12-04T09:45:06.3431499Z docker pull ghcr.io/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:45:06.3470680Z ##[group]Run pytorch/test-infra/.github/actions/pull-docker-image@main 2025-12-04T09:45:06.3471012Z with: 2025-12-04T09:45:06.3471581Z docker-image: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:45:06.3472261Z docker-registry: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:45:06.3472542Z env: 2025-12-04T09:45:06.3472692Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:06.3472873Z ##[endgroup] 2025-12-04T09:45:06.3486101Z ##[group]Run set -x 2025-12-04T09:45:06.3486297Z set -x 2025-12-04T09:45:06.3486474Z set +e 2025-12-04T09:45:06.3486635Z  2025-12-04T09:45:06.3486781Z login() { 2025-12-04T09:45:06.3487116Z  aws ecr get-login-password --region us-east-1 | docker login -u AWS --password-stdin "$1" 2025-12-04T09:45:06.3487487Z } 2025-12-04T09:45:06.3487643Z  2025-12-04T09:45:06.3487816Z retry () { 2025-12-04T09:45:06.3488000Z  $* || (sleep 1 && $*) || (sleep 2 && $*) 2025-12-04T09:45:06.3488222Z } 2025-12-04T09:45:06.3488370Z  2025-12-04T09:45:06.3488526Z retry login "${DOCKER_REGISTRY}" 2025-12-04T09:45:06.3488730Z  2025-12-04T09:45:06.3489068Z IMAGE_SIZE=$(docker manifest inspect "${DOCKER_IMAGE}" | jq '[.layers[].size, .config.size] | add / 1024 / 1024') 2025-12-04T09:45:06.3489528Z echo "Compressed size of image in MB: ${IMAGE_SIZE}" 2025-12-04T09:45:06.3489785Z  2025-12-04T09:45:06.3489929Z set -e 2025-12-04T09:45:06.3490343Z # ignore output since only exit code is used for conditional 2025-12-04T09:45:06.3490685Z # only pull docker image if it's not available locally 2025-12-04T09:45:06.3491058Z if ! docker inspect --type=image "${DOCKER_IMAGE}" >/dev/null 2>/dev/null; then 2025-12-04T09:45:06.3491413Z  retry docker pull "${DOCKER_IMAGE}" 2025-12-04T09:45:06.3491652Z fi 2025-12-04T09:45:06.3498630Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:45:06.3498896Z env: 2025-12-04T09:45:06.3499050Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:45:06.3499643Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:45:06.3500318Z DOCKER_REGISTRY: 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:45:06.3500595Z ##[endgroup] 2025-12-04T09:45:06.3525646Z + set +e 2025-12-04T09:45:06.3526102Z + retry login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:45:06.3526545Z + login 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:45:06.3529179Z + aws ecr get-login-password --region us-east-1 2025-12-04T09:45:06.3530322Z + docker login -u AWS --password-stdin 308535385114.dkr.ecr.us-east-1.amazonaws.com 2025-12-04T09:45:06.8241356Z WARNING! Your password will be stored unencrypted in /home/ec2-user/.docker/config.json. 2025-12-04T09:45:06.8242805Z Configure a credential helper to remove this warning. See 2025-12-04T09:45:06.8243757Z https://docs.docker.com/engine/reference/commandline/login/#credentials-store 2025-12-04T09:45:06.8244339Z 2025-12-04T09:45:06.8244640Z Login Succeeded 2025-12-04T09:45:06.8268881Z ++ docker manifest inspect 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:45:06.8270209Z ++ jq '[.layers[].size, .config.size] | add / 1024 / 1024' 2025-12-04T09:45:07.0277145Z + IMAGE_SIZE=15091.581844329834 2025-12-04T09:45:07.0277706Z + echo 'Compressed size of image in MB: 15091.581844329834' 2025-12-04T09:45:07.0278241Z + set -e 2025-12-04T09:45:07.0279351Z + docker inspect --type=image 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:45:07.0280293Z Compressed size of image in MB: 15091.581844329834 2025-12-04T09:45:07.0407038Z + retry docker pull 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:45:07.0408350Z + docker pull 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:45:07.2943578Z pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a: Pulling from pytorch/ci-image 2025-12-04T09:45:07.2946489Z 63e5bc7682b8: Pulling fs layer 2025-12-04T09:45:07.2946916Z 0678d56345c9: Pulling fs layer 2025-12-04T09:45:07.2947391Z 45f5c9ddfce7: Pulling fs layer 2025-12-04T09:45:07.2947706Z 086b1df51ac1: Pulling fs layer 2025-12-04T09:45:07.2948020Z fe8a7b64bf98: Pulling fs layer 2025-12-04T09:45:07.2948319Z 7680723e9a57: Pulling fs layer 2025-12-04T09:45:07.2948630Z 9c5027aeeb4e: Pulling fs layer 2025-12-04T09:45:07.2948939Z 9a5652110360: Pulling fs layer 2025-12-04T09:45:07.2949235Z 375c4427e914: Pulling fs layer 2025-12-04T09:45:07.2949538Z a86faaa7dbdd: Pulling fs layer 2025-12-04T09:45:07.2949853Z fb7848686804: Pulling fs layer 2025-12-04T09:45:07.2950169Z 3541df015cdb: Pulling fs layer 2025-12-04T09:45:07.2950459Z 79dc80f426b2: Pulling fs layer 2025-12-04T09:45:07.2950668Z a13fcc1b90bb: Pulling fs layer 2025-12-04T09:45:07.2950872Z 4f4fb700ef54: Pulling fs layer 2025-12-04T09:45:07.2951053Z 549db4d6c618: Pulling fs layer 2025-12-04T09:45:07.2951237Z 5c63528cb580: Pulling fs layer 2025-12-04T09:45:07.2951420Z 75bd83b989a4: Pulling fs layer 2025-12-04T09:45:07.2951915Z de6e78970f51: Pulling fs layer 2025-12-04T09:45:07.2952269Z e13ed7c7e473: Pulling fs layer 2025-12-04T09:45:07.2952559Z 6e2949bcb741: Pulling fs layer 2025-12-04T09:45:07.2952893Z 14d69d9aaec7: Pulling fs layer 2025-12-04T09:45:07.2953236Z 5c02769dd8e5: Pulling fs layer 2025-12-04T09:45:07.2953606Z 35041ce524ac: Pulling fs layer 2025-12-04T09:45:07.2953941Z 2fa92dc5885e: Pulling fs layer 2025-12-04T09:45:07.2954283Z 2b85eafbd92a: Pulling fs layer 2025-12-04T09:45:07.2954620Z ff755a4ddad7: Pulling fs layer 2025-12-04T09:45:07.2954957Z 09eb41bdf42d: Pulling fs layer 2025-12-04T09:45:07.2955486Z 11ede4d59e93: Pulling fs layer 2025-12-04T09:45:07.2955849Z 1283cd8f801a: Pulling fs layer 2025-12-04T09:45:07.2956183Z 024fa855425f: Pulling fs layer 2025-12-04T09:45:07.2956519Z 303e6747a62e: Pulling fs layer 2025-12-04T09:45:07.2956876Z 3017cdf4838b: Pulling fs layer 2025-12-04T09:45:07.2957233Z 6b6cd1c358e8: Pulling fs layer 2025-12-04T09:45:07.2957585Z b2dd04501124: Pulling fs layer 2025-12-04T09:45:07.2957798Z 55adc51fe589: Pulling fs layer 2025-12-04T09:45:07.2957996Z a43ca0e4b837: Pulling fs layer 2025-12-04T09:45:07.2958312Z b7212f17fd14: Pulling fs layer 2025-12-04T09:45:07.2958493Z 083e42cac090: Pulling fs layer 2025-12-04T09:45:07.2958693Z 0a00b784a4aa: Pulling fs layer 2025-12-04T09:45:07.2958877Z c6173c779f7b: Pulling fs layer 2025-12-04T09:45:07.2959069Z ed3d1e3387b9: Pulling fs layer 2025-12-04T09:45:07.2959266Z b29343478586: Pulling fs layer 2025-12-04T09:45:07.2959457Z c6f0520487fb: Pulling fs layer 2025-12-04T09:45:07.2959631Z 148171691cd4: Pulling fs layer 2025-12-04T09:45:07.2959812Z 2c666d30ed77: Pulling fs layer 2025-12-04T09:45:07.2959993Z 5d8d3a0a98e0: Pulling fs layer 2025-12-04T09:45:07.2960180Z b06bafce9e81: Pulling fs layer 2025-12-04T09:45:07.2960356Z 15e0d7e4590d: Pulling fs layer 2025-12-04T09:45:07.2960542Z a514bd1add31: Pulling fs layer 2025-12-04T09:45:07.2960728Z 57b84ee60002: Pulling fs layer 2025-12-04T09:45:07.2960907Z b8babeff6d81: Pulling fs layer 2025-12-04T09:45:07.2961106Z 83779ddf6a85: Pulling fs layer 2025-12-04T09:45:07.2961291Z 8b7620c0d736: Pulling fs layer 2025-12-04T09:45:07.2961469Z 3bcfa090e4ef: Pulling fs layer 2025-12-04T09:45:07.2961655Z eb0504ec4d92: Pulling fs layer 2025-12-04T09:45:07.2961885Z 15d0fec09d7b: Pulling fs layer 2025-12-04T09:45:07.2962144Z cca81fcc62a9: Pulling fs layer 2025-12-04T09:45:07.2962612Z b0b8f9b5c6ab: Pulling fs layer 2025-12-04T09:45:07.2962944Z 0606ca4d47a8: Pulling fs layer 2025-12-04T09:45:07.2963134Z 2f80a4e1b3b9: Pulling fs layer 2025-12-04T09:45:07.2963320Z 35c916fb1bd0: Pulling fs layer 2025-12-04T09:45:07.2963538Z 195537b7dafc: Pulling fs layer 2025-12-04T09:45:07.2963834Z dc454fd3967e: Pulling fs layer 2025-12-04T09:45:07.2964030Z 701b34f115fa: Pulling fs layer 2025-12-04T09:45:07.2964216Z 39cefc00ffed: Pulling fs layer 2025-12-04T09:45:07.2964401Z 6ae51eb61a32: Pulling fs layer 2025-12-04T09:45:07.2964583Z 1fd5341e66df: Pulling fs layer 2025-12-04T09:45:07.2964758Z 72a7c87e35e4: Pulling fs layer 2025-12-04T09:45:07.2964947Z fe8a7b64bf98: Waiting 2025-12-04T09:45:07.2965122Z ec36862ac98e: Pulling fs layer 2025-12-04T09:45:07.2965299Z 05ddbf246e8a: Pulling fs layer 2025-12-04T09:45:07.2965472Z de6e78970f51: Waiting 2025-12-04T09:45:07.2965627Z b2dd04501124: Waiting 2025-12-04T09:45:07.2965779Z e13ed7c7e473: Waiting 2025-12-04T09:45:07.2965940Z 375c4427e914: Waiting 2025-12-04T09:45:07.2966095Z 9c5027aeeb4e: Waiting 2025-12-04T09:45:07.2966244Z 11ede4d59e93: Waiting 2025-12-04T09:45:07.2966397Z 6e2949bcb741: Waiting 2025-12-04T09:45:07.2966558Z 7680723e9a57: Waiting 2025-12-04T09:45:07.2966708Z 9a5652110360: Waiting 2025-12-04T09:45:07.2966862Z 14d69d9aaec7: Waiting 2025-12-04T09:45:07.2967022Z b8babeff6d81: Waiting 2025-12-04T09:45:07.2967172Z 55adc51fe589: Waiting 2025-12-04T09:45:07.2967327Z 79dc80f426b2: Waiting 2025-12-04T09:45:07.2967483Z a86faaa7dbdd: Waiting 2025-12-04T09:45:07.2967633Z 3541df015cdb: Waiting 2025-12-04T09:45:07.2967792Z 2fa92dc5885e: Waiting 2025-12-04T09:45:07.2967948Z a13fcc1b90bb: Waiting 2025-12-04T09:45:07.2968269Z a43ca0e4b837: Waiting 2025-12-04T09:45:07.2968428Z 5c63528cb580: Waiting 2025-12-04T09:45:07.2968584Z 4f4fb700ef54: Waiting 2025-12-04T09:45:07.2968739Z 83779ddf6a85: Waiting 2025-12-04T09:45:07.2968892Z 086b1df51ac1: Waiting 2025-12-04T09:45:07.2969058Z 2b85eafbd92a: Waiting 2025-12-04T09:45:07.2969217Z 549db4d6c618: Waiting 2025-12-04T09:45:07.2969378Z 75bd83b989a4: Waiting 2025-12-04T09:45:07.2969538Z b7212f17fd14: Waiting 2025-12-04T09:45:07.2969697Z ff755a4ddad7: Waiting 2025-12-04T09:45:07.2969850Z 2f80a4e1b3b9: Waiting 2025-12-04T09:45:07.2970008Z 303e6747a62e: Waiting 2025-12-04T09:45:07.2970166Z 35c916fb1bd0: Waiting 2025-12-04T09:45:07.2970323Z 1fd5341e66df: Waiting 2025-12-04T09:45:07.2970481Z 195537b7dafc: Waiting 2025-12-04T09:45:07.2970649Z dc454fd3967e: Waiting 2025-12-04T09:45:07.2970803Z b06bafce9e81: Waiting 2025-12-04T09:45:07.2970963Z 72a7c87e35e4: Waiting 2025-12-04T09:45:07.2971119Z ec36862ac98e: Waiting 2025-12-04T09:45:07.2971267Z 05ddbf246e8a: Waiting 2025-12-04T09:45:07.2971427Z 701b34f115fa: Waiting 2025-12-04T09:45:07.2971584Z 39cefc00ffed: Waiting 2025-12-04T09:45:07.2971732Z 15e0d7e4590d: Waiting 2025-12-04T09:45:07.2971887Z 3bcfa090e4ef: Waiting 2025-12-04T09:45:07.2972043Z ed3d1e3387b9: Waiting 2025-12-04T09:45:07.2972190Z 6ae51eb61a32: Waiting 2025-12-04T09:45:07.2972344Z b29343478586: Waiting 2025-12-04T09:45:07.2972498Z c6f0520487fb: Waiting 2025-12-04T09:45:07.2972651Z eb0504ec4d92: Waiting 2025-12-04T09:45:07.2972798Z 3017cdf4838b: Waiting 2025-12-04T09:45:07.2972964Z 15d0fec09d7b: Waiting 2025-12-04T09:45:07.2973122Z cca81fcc62a9: Waiting 2025-12-04T09:45:07.2973271Z 0606ca4d47a8: Waiting 2025-12-04T09:45:07.2973423Z 148171691cd4: Waiting 2025-12-04T09:45:07.2973574Z 2c666d30ed77: Waiting 2025-12-04T09:45:07.2973724Z b0b8f9b5c6ab: Waiting 2025-12-04T09:45:07.2973881Z a514bd1add31: Waiting 2025-12-04T09:45:07.2974034Z 35041ce524ac: Waiting 2025-12-04T09:45:07.2974180Z 57b84ee60002: Waiting 2025-12-04T09:45:07.2974332Z 0a00b784a4aa: Waiting 2025-12-04T09:45:07.2974490Z c6173c779f7b: Waiting 2025-12-04T09:45:07.2974639Z 024fa855425f: Waiting 2025-12-04T09:45:07.2974792Z 6b6cd1c358e8: Waiting 2025-12-04T09:45:07.2974946Z 5c02769dd8e5: Waiting 2025-12-04T09:45:07.2975092Z 5d8d3a0a98e0: Waiting 2025-12-04T09:45:07.2975249Z 1283cd8f801a: Waiting 2025-12-04T09:45:07.2975406Z 8b7620c0d736: Waiting 2025-12-04T09:45:07.2975655Z 083e42cac090: Waiting 2025-12-04T09:45:07.2975818Z 09eb41bdf42d: Waiting 2025-12-04T09:45:07.2975995Z fb7848686804: Waiting 2025-12-04T09:45:07.3908276Z 0678d56345c9: Download complete 2025-12-04T09:45:07.4913496Z 086b1df51ac1: Verifying Checksum 2025-12-04T09:45:07.4913823Z 086b1df51ac1: Download complete 2025-12-04T09:45:07.5774055Z fe8a7b64bf98: Verifying Checksum 2025-12-04T09:45:07.5774379Z fe8a7b64bf98: Download complete 2025-12-04T09:45:07.6417889Z 63e5bc7682b8: Verifying Checksum 2025-12-04T09:45:07.6418173Z 63e5bc7682b8: Download complete 2025-12-04T09:45:07.6737674Z 7680723e9a57: Verifying Checksum 2025-12-04T09:45:07.6738089Z 7680723e9a57: Download complete 2025-12-04T09:45:07.7518610Z 9c5027aeeb4e: Verifying Checksum 2025-12-04T09:45:07.7518958Z 9c5027aeeb4e: Download complete 2025-12-04T09:45:07.7712604Z 9a5652110360: Verifying Checksum 2025-12-04T09:45:07.7712936Z 9a5652110360: Download complete 2025-12-04T09:45:07.8614821Z a86faaa7dbdd: Verifying Checksum 2025-12-04T09:45:07.8615470Z a86faaa7dbdd: Download complete 2025-12-04T09:45:07.9586491Z fb7848686804: Verifying Checksum 2025-12-04T09:45:08.0592647Z 3541df015cdb: Verifying Checksum 2025-12-04T09:45:08.0592965Z 3541df015cdb: Download complete 2025-12-04T09:45:08.1637618Z 79dc80f426b2: Verifying Checksum 2025-12-04T09:45:08.1637971Z 79dc80f426b2: Download complete 2025-12-04T09:45:08.5489117Z 63e5bc7682b8: Pull complete 2025-12-04T09:45:08.5720344Z 0678d56345c9: Pull complete 2025-12-04T09:45:08.9283269Z 375c4427e914: Verifying Checksum 2025-12-04T09:45:08.9283641Z 375c4427e914: Download complete 2025-12-04T09:45:08.9369525Z 4f4fb700ef54: Verifying Checksum 2025-12-04T09:45:08.9370054Z 4f4fb700ef54: Download complete 2025-12-04T09:45:09.0512438Z 549db4d6c618: Download complete 2025-12-04T09:45:09.1508955Z 5c63528cb580: Download complete 2025-12-04T09:45:09.2419763Z 75bd83b989a4: Verifying Checksum 2025-12-04T09:45:09.2420189Z 75bd83b989a4: Download complete 2025-12-04T09:45:09.3469438Z de6e78970f51: Verifying Checksum 2025-12-04T09:45:09.3469769Z de6e78970f51: Download complete 2025-12-04T09:45:09.4432202Z e13ed7c7e473: Verifying Checksum 2025-12-04T09:45:09.4432524Z e13ed7c7e473: Download complete 2025-12-04T09:45:09.5255033Z 6e2949bcb741: Verifying Checksum 2025-12-04T09:45:09.5255710Z 6e2949bcb741: Download complete 2025-12-04T09:45:09.5985427Z 14d69d9aaec7: Verifying Checksum 2025-12-04T09:45:09.5985731Z 14d69d9aaec7: Download complete 2025-12-04T09:45:09.6911900Z 5c02769dd8e5: Verifying Checksum 2025-12-04T09:45:09.6912203Z 5c02769dd8e5: Download complete 2025-12-04T09:45:10.4713429Z 45f5c9ddfce7: Verifying Checksum 2025-12-04T09:45:10.4713733Z 45f5c9ddfce7: Download complete 2025-12-04T09:45:10.5428767Z 2fa92dc5885e: Verifying Checksum 2025-12-04T09:45:10.5429078Z 2fa92dc5885e: Download complete 2025-12-04T09:45:10.9399600Z 2b85eafbd92a: Verifying Checksum 2025-12-04T09:45:10.9399928Z 2b85eafbd92a: Download complete 2025-12-04T09:45:11.0068189Z ff755a4ddad7: Download complete 2025-12-04T09:45:11.1036602Z 09eb41bdf42d: Verifying Checksum 2025-12-04T09:45:11.1036905Z 09eb41bdf42d: Download complete 2025-12-04T09:45:15.7624392Z 11ede4d59e93: Verifying Checksum 2025-12-04T09:45:15.7624907Z 11ede4d59e93: Download complete 2025-12-04T09:45:15.8371495Z 1283cd8f801a: Download complete 2025-12-04T09:45:15.9215406Z 024fa855425f: Verifying Checksum 2025-12-04T09:45:15.9215753Z 024fa855425f: Download complete 2025-12-04T09:45:15.9959325Z 303e6747a62e: Verifying Checksum 2025-12-04T09:45:15.9959606Z 303e6747a62e: Download complete 2025-12-04T09:45:16.0925993Z 3017cdf4838b: Download complete 2025-12-04T09:45:16.3490228Z 6b6cd1c358e8: Verifying Checksum 2025-12-04T09:45:16.4253341Z b2dd04501124: Verifying Checksum 2025-12-04T09:45:16.4253852Z b2dd04501124: Download complete 2025-12-04T09:45:16.5108630Z 55adc51fe589: Download complete 2025-12-04T09:45:16.6068694Z a43ca0e4b837: Verifying Checksum 2025-12-04T09:45:16.6071354Z a43ca0e4b837: Download complete 2025-12-04T09:45:16.6987050Z b7212f17fd14: Verifying Checksum 2025-12-04T09:45:16.6987762Z b7212f17fd14: Download complete 2025-12-04T09:45:16.7857438Z 083e42cac090: Verifying Checksum 2025-12-04T09:45:16.7857757Z 083e42cac090: Download complete 2025-12-04T09:45:16.8898864Z 0a00b784a4aa: Download complete 2025-12-04T09:45:16.9826584Z c6173c779f7b: Verifying Checksum 2025-12-04T09:45:16.9827041Z c6173c779f7b: Download complete 2025-12-04T09:45:17.5569739Z 45f5c9ddfce7: Pull complete 2025-12-04T09:45:17.5794610Z 086b1df51ac1: Pull complete 2025-12-04T09:45:17.6010205Z fe8a7b64bf98: Pull complete 2025-12-04T09:45:17.6227649Z 7680723e9a57: Pull complete 2025-12-04T09:45:17.6465003Z 9c5027aeeb4e: Pull complete 2025-12-04T09:45:17.6702686Z 9a5652110360: Pull complete 2025-12-04T09:45:18.4859292Z ed3d1e3387b9: Verifying Checksum 2025-12-04T09:45:18.4859592Z ed3d1e3387b9: Download complete 2025-12-04T09:45:18.5889202Z b29343478586: Verifying Checksum 2025-12-04T09:45:18.5889502Z b29343478586: Download complete 2025-12-04T09:45:19.5965154Z 375c4427e914: Pull complete 2025-12-04T09:45:19.7754220Z a86faaa7dbdd: Pull complete 2025-12-04T09:45:19.9958317Z fb7848686804: Pull complete 2025-12-04T09:45:20.1656914Z 3541df015cdb: Pull complete 2025-12-04T09:45:20.2620935Z 79dc80f426b2: Pull complete 2025-12-04T09:45:21.7652779Z c6f0520487fb: Verifying Checksum 2025-12-04T09:45:21.7653081Z c6f0520487fb: Download complete 2025-12-04T09:45:40.0959815Z a13fcc1b90bb: Verifying Checksum 2025-12-04T09:45:40.0960196Z a13fcc1b90bb: Download complete 2025-12-04T09:45:40.1746735Z 2c666d30ed77: Verifying Checksum 2025-12-04T09:45:40.1747022Z 2c666d30ed77: Download complete 2025-12-04T09:45:40.2734070Z 5d8d3a0a98e0: Verifying Checksum 2025-12-04T09:45:40.2734358Z 5d8d3a0a98e0: Download complete 2025-12-04T09:45:40.3805728Z b06bafce9e81: Download complete 2025-12-04T09:45:40.4611198Z 15e0d7e4590d: Verifying Checksum 2025-12-04T09:45:40.4611532Z 15e0d7e4590d: Download complete 2025-12-04T09:45:40.5316357Z a514bd1add31: Verifying Checksum 2025-12-04T09:45:40.5316684Z a514bd1add31: Download complete 2025-12-04T09:45:40.6099624Z 57b84ee60002: Verifying Checksum 2025-12-04T09:45:40.6099944Z 57b84ee60002: Download complete 2025-12-04T09:45:40.7041629Z b8babeff6d81: Verifying Checksum 2025-12-04T09:45:40.7042000Z b8babeff6d81: Download complete 2025-12-04T09:45:40.7962291Z 83779ddf6a85: Verifying Checksum 2025-12-04T09:45:40.7962653Z 83779ddf6a85: Download complete 2025-12-04T09:45:40.8837165Z 8b7620c0d736: Download complete 2025-12-04T09:45:40.9615719Z 3bcfa090e4ef: Verifying Checksum 2025-12-04T09:45:40.9616146Z 3bcfa090e4ef: Download complete 2025-12-04T09:45:41.2665162Z eb0504ec4d92: Verifying Checksum 2025-12-04T09:45:41.2665557Z eb0504ec4d92: Download complete 2025-12-04T09:45:41.3266586Z 15d0fec09d7b: Verifying Checksum 2025-12-04T09:45:41.3267156Z 15d0fec09d7b: Download complete 2025-12-04T09:45:41.3976504Z cca81fcc62a9: Verifying Checksum 2025-12-04T09:45:41.3976995Z cca81fcc62a9: Download complete 2025-12-04T09:45:41.4934569Z b0b8f9b5c6ab: Download complete 2025-12-04T09:45:41.5698577Z 0606ca4d47a8: Verifying Checksum 2025-12-04T09:45:41.5700633Z 0606ca4d47a8: Download complete 2025-12-04T09:45:41.6668400Z 2f80a4e1b3b9: Download complete 2025-12-04T09:45:41.7392728Z 35c916fb1bd0: Verifying Checksum 2025-12-04T09:45:41.7393090Z 35c916fb1bd0: Download complete 2025-12-04T09:45:43.7702717Z 195537b7dafc: Verifying Checksum 2025-12-04T09:45:43.7703200Z 195537b7dafc: Download complete 2025-12-04T09:45:43.8519530Z dc454fd3967e: Verifying Checksum 2025-12-04T09:45:43.8519841Z dc454fd3967e: Download complete 2025-12-04T09:45:43.9234808Z 701b34f115fa: Download complete 2025-12-04T09:45:44.0092427Z 39cefc00ffed: Verifying Checksum 2025-12-04T09:45:44.0092892Z 39cefc00ffed: Download complete 2025-12-04T09:45:44.0862778Z 6ae51eb61a32: Download complete 2025-12-04T09:45:44.1599125Z 1fd5341e66df: Verifying Checksum 2025-12-04T09:45:44.1599671Z 1fd5341e66df: Download complete 2025-12-04T09:45:44.3319258Z 72a7c87e35e4: Verifying Checksum 2025-12-04T09:45:44.3319616Z 72a7c87e35e4: Download complete 2025-12-04T09:45:44.4317198Z ec36862ac98e: Verifying Checksum 2025-12-04T09:45:44.4317724Z ec36862ac98e: Download complete 2025-12-04T09:45:45.0192455Z 05ddbf246e8a: Verifying Checksum 2025-12-04T09:45:45.0192750Z 05ddbf246e8a: Download complete 2025-12-04T09:45:52.3963055Z 148171691cd4: Verifying Checksum 2025-12-04T09:45:52.3963566Z 148171691cd4: Download complete 2025-12-04T09:46:30.2284283Z 35041ce524ac: Verifying Checksum 2025-12-04T09:46:30.2284659Z 35041ce524ac: Download complete 2025-12-04T09:47:01.7894473Z a13fcc1b90bb: Pull complete 2025-12-04T09:47:01.9258935Z 4f4fb700ef54: Pull complete 2025-12-04T09:47:02.0507691Z 549db4d6c618: Pull complete 2025-12-04T09:47:02.2060468Z 5c63528cb580: Pull complete 2025-12-04T09:47:02.4120307Z 75bd83b989a4: Pull complete 2025-12-04T09:47:02.6798747Z de6e78970f51: Pull complete 2025-12-04T09:47:02.8787737Z e13ed7c7e473: Pull complete 2025-12-04T09:47:03.1080792Z 6e2949bcb741: Pull complete 2025-12-04T09:47:03.3109223Z 14d69d9aaec7: Pull complete 2025-12-04T09:47:03.5151613Z 5c02769dd8e5: Pull complete 2025-12-04T09:48:36.2502194Z 35041ce524ac: Pull complete 2025-12-04T09:48:36.4691257Z 2fa92dc5885e: Pull complete 2025-12-04T09:48:37.0419041Z 2b85eafbd92a: Pull complete 2025-12-04T09:48:37.2471526Z ff755a4ddad7: Pull complete 2025-12-04T09:48:37.2731171Z 09eb41bdf42d: Pull complete 2025-12-04T09:48:43.8884128Z 11ede4d59e93: Pull complete 2025-12-04T09:48:44.1105726Z 1283cd8f801a: Pull complete 2025-12-04T09:48:44.3186048Z 024fa855425f: Pull complete 2025-12-04T09:48:44.7260034Z 303e6747a62e: Pull complete 2025-12-04T09:48:44.9474276Z 3017cdf4838b: Pull complete 2025-12-04T09:48:45.3052002Z 6b6cd1c358e8: Pull complete 2025-12-04T09:48:45.5098321Z b2dd04501124: Pull complete 2025-12-04T09:48:45.7356991Z 55adc51fe589: Pull complete 2025-12-04T09:48:46.1834149Z a43ca0e4b837: Pull complete 2025-12-04T09:48:46.4062604Z b7212f17fd14: Pull complete 2025-12-04T09:48:46.6174348Z 083e42cac090: Pull complete 2025-12-04T09:48:47.0428359Z 0a00b784a4aa: Pull complete 2025-12-04T09:48:47.2637906Z c6173c779f7b: Pull complete 2025-12-04T09:48:50.0456428Z ed3d1e3387b9: Pull complete 2025-12-04T09:48:50.2610991Z b29343478586: Pull complete 2025-12-04T09:48:51.3080139Z c6f0520487fb: Pull complete 2025-12-04T09:49:35.2054086Z 148171691cd4: Pull complete 2025-12-04T09:49:35.4012292Z 2c666d30ed77: Pull complete 2025-12-04T09:49:35.5817541Z 5d8d3a0a98e0: Pull complete 2025-12-04T09:49:35.8597713Z b06bafce9e81: Pull complete 2025-12-04T09:49:35.9992806Z 15e0d7e4590d: Pull complete 2025-12-04T09:49:36.0695313Z a514bd1add31: Pull complete 2025-12-04T09:49:36.3519040Z 57b84ee60002: Pull complete 2025-12-04T09:49:36.7250388Z b8babeff6d81: Pull complete 2025-12-04T09:49:36.9004618Z 83779ddf6a85: Pull complete 2025-12-04T09:49:37.2431380Z 8b7620c0d736: Pull complete 2025-12-04T09:49:37.5557218Z 3bcfa090e4ef: Pull complete 2025-12-04T09:49:37.7689786Z eb0504ec4d92: Pull complete 2025-12-04T09:49:38.1205353Z 15d0fec09d7b: Pull complete 2025-12-04T09:49:38.3122130Z cca81fcc62a9: Pull complete 2025-12-04T09:49:38.6693330Z b0b8f9b5c6ab: Pull complete 2025-12-04T09:49:38.8732343Z 0606ca4d47a8: Pull complete 2025-12-04T09:49:39.2322276Z 2f80a4e1b3b9: Pull complete 2025-12-04T09:49:39.4341064Z 35c916fb1bd0: Pull complete 2025-12-04T09:49:44.7881364Z 195537b7dafc: Pull complete 2025-12-04T09:49:45.0117238Z dc454fd3967e: Pull complete 2025-12-04T09:49:45.2243388Z 701b34f115fa: Pull complete 2025-12-04T09:49:45.4369345Z 39cefc00ffed: Pull complete 2025-12-04T09:49:45.6552723Z 6ae51eb61a32: Pull complete 2025-12-04T09:49:45.8917929Z 1fd5341e66df: Pull complete 2025-12-04T09:49:47.3021759Z 72a7c87e35e4: Pull complete 2025-12-04T09:49:47.5079040Z ec36862ac98e: Pull complete 2025-12-04T09:49:48.7707547Z 05ddbf246e8a: Pull complete 2025-12-04T09:49:49.0269177Z Digest: sha256:ba21003510dba4bdeed83df81a56fa468e0ee1b612a9445ae1f402a280804f97 2025-12-04T09:49:49.0604549Z Status: Downloaded newer image for 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:49:49.0824387Z 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:49:49.0878563Z ##[group]Run echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T09:49:49.0879308Z echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T09:49:49.0889350Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:49:49.0889637Z env: 2025-12-04T09:49:49.0889790Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:49:49.0889992Z ##[endgroup] 2025-12-04T09:49:49.1068486Z ##[group]Run pytorch/test-infra/.github/actions/setup-nvidia@main 2025-12-04T09:49:49.1068828Z with: 2025-12-04T09:49:49.1068989Z driver-version: 580.82.07 2025-12-04T09:49:49.1069178Z env: 2025-12-04T09:49:49.1069338Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:49:49.1069519Z ##[endgroup] 2025-12-04T09:49:49.1128455Z ##[group]Run echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T09:49:49.1129102Z echo "IN_CONTAINER_RUNNER=$(if [ -f /.inarc ] || [ -f /.incontainer ]; then echo true ; else echo false; fi)" >> "$GITHUB_OUTPUT" 2025-12-04T09:49:49.1136645Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:49:49.1136916Z env: 2025-12-04T09:49:49.1137074Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:49:49.1137247Z ##[endgroup] 2025-12-04T09:49:49.1207370Z ##[group]Run set -euo pipefail 2025-12-04T09:49:49.1207646Z set -euo pipefail 2025-12-04T09:49:49.1208092Z  2025-12-04T09:49:49.1208242Z has_gpu=false 2025-12-04T09:49:49.1208433Z devices="" 2025-12-04T09:49:49.1208608Z  2025-12-04T09:49:49.1208810Z if command -v nvidia-smi >/dev/null 2>&1; then 2025-12-04T09:49:49.1209130Z  if nvidia-smi -L >/tmp/nvidia_devices 2>/dev/null; then 2025-12-04T09:49:49.1209405Z  has_gpu=true 2025-12-04T09:49:49.1209614Z  devices=$(cat /tmp/nvidia_devices) 2025-12-04T09:49:49.1209833Z  fi 2025-12-04T09:49:49.1209987Z fi 2025-12-04T09:49:49.1210135Z  2025-12-04T09:49:49.1210286Z if [ "$has_gpu" = false ]; then 2025-12-04T09:49:49.1210556Z  if ls /dev/nvidia* >/tmp/nvidia_devices 2>/dev/null; then 2025-12-04T09:49:49.1210818Z  has_gpu=true 2025-12-04T09:49:49.1211029Z  devices=$(cat /tmp/nvidia_devices) 2025-12-04T09:49:49.1211242Z  fi 2025-12-04T09:49:49.1211398Z fi 2025-12-04T09:49:49.1211548Z  2025-12-04T09:49:49.1211763Z if [ "$has_gpu" = false ] && command -v lspci >/dev/null 2>&1; then 2025-12-04T09:49:49.1212125Z  if lspci | grep -i 'nvidia' >/tmp/nvidia_devices 2>/dev/null; then 2025-12-04T09:49:49.1212423Z  has_gpu=true 2025-12-04T09:49:49.1212625Z  devices=$(cat /tmp/nvidia_devices) 2025-12-04T09:49:49.1212842Z  fi 2025-12-04T09:49:49.1212992Z fi 2025-12-04T09:49:49.1213128Z  2025-12-04T09:49:49.1213342Z printf 'HAS_NVIDIA=%s\n' "$has_gpu" >> "$GITHUB_OUTPUT" 2025-12-04T09:49:49.1213728Z printf 'DETECTED_DEVICES<> "$GITHUB_OUTPUT" 2025-12-04T09:49:49.1220773Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:49:49.1221031Z env: 2025-12-04T09:49:49.1221197Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:49:49.1221379Z ##[endgroup] 2025-12-04T09:49:50.7621045Z ##[group]Run if [ "${HAS_NVIDIA}" = "true" ]; then 2025-12-04T09:49:50.7621369Z if [ "${HAS_NVIDIA}" = "true" ]; then 2025-12-04T09:49:50.7621635Z  echo "HAS_NVIDIA_GPU=true" >> "${GITHUB_ENV}" 2025-12-04T09:49:50.7622007Z  echo "GPU_FLAG=--gpus all -e NVIDIA_DRIVER_CAPABILITIES=all" >> "${GITHUB_ENV}" 2025-12-04T09:49:50.7622326Z else 2025-12-04T09:49:50.7622523Z  echo "HAS_NVIDIA_GPU=false" >> "${GITHUB_ENV}" 2025-12-04T09:49:50.7622759Z fi 2025-12-04T09:49:50.7632164Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:49:50.7632438Z env: 2025-12-04T09:49:50.7632601Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:49:50.7632787Z HAS_NVIDIA: true 2025-12-04T09:49:50.7632942Z ##[endgroup] 2025-12-04T09:49:50.7718736Z ##[group]Run nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482 2025-12-04T09:49:50.7719034Z with: 2025-12-04T09:49:50.7719180Z timeout_minutes: 10 2025-12-04T09:49:50.7719355Z max_attempts: 3 2025-12-04T09:49:50.7738759Z command: # Is it disgusting to have a full shell script here in this github action? Sure # But is it the best way to make it so that this action relies on nothing else? Absolutely set -eou pipefail DISTRIBUTION=$(. /etc/os-release;echo $ID$VERSION_ID) DRIVER_FN="NVIDIA-Linux-x86_64-${DRIVER_VERSION}.run" install_nvidia_docker2_amzn2() { ( set -x # Needed for yum-config-manager sudo yum install -y yum-utils if [[ "${DISTRIBUTION}" == "amzn2023" ]] ; then YUM_REPO_URL="https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo" else # Amazon Linux 2 YUM_REPO_URL="https://nvidia.github.io/nvidia-docker/${DISTRIBUTION}/nvidia-docker.repo" fi sudo yum-config-manager --add-repo "${YUM_REPO_URL}" sudo yum install -y \ nvidia-container-toolkit-1.17.8 \ libnvidia-container-tools-1.17.8 \ libnvidia-container1-1.17.8 \ nvidia-container-toolkit-base-1.17.8 sudo systemctl restart docker ) } install_nvidia_docker2_ubuntu20() { ( set -x # Install nvidia-driver package if not installed status="$(dpkg-query -W --showformat='${db:Status-Status}' nvidia-docker2 2>&1)" if [ ! $? = 0 ] || [ ! "$status" = installed ]; then sudo apt-get install -y nvidia-container-toolkit-1.17.8 sudo systemctl restart docker fi ) } pre_install_nvidia_driver_amzn2() { ( # Purge any nvidia driver installed from RHEL repo sudo yum remove -y nvidia-driver-latest-dkms ) } install_nvidia_driver_common() { ( # Try to gather more information about the runner and its existing NVIDIA driver if any echo "Before installing NVIDIA driver" lspci lsmod modinfo nvidia || true HAS_NVIDIA_DRIVER=0 # Check if NVIDIA driver has already been installed if [ -x "$(command -v nvidia-smi)" ]; then set +e # The driver exists, check its version next. Also check only the first GPU if there are more than one of them # so that the same driver version is not print over multiple lines INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then echo "Failed to get NVIDIA driver version ($INSTALLED_DRIVER_VERSION). Continuing" elif [ "$INSTALLED_DRIVER_VERSION" != "$DRIVER_VERSION" ]; then echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has been installed, but we expect to have $DRIVER_VERSION instead. Continuing" # Turn off persistent mode so that the installation script can unload the kernel module sudo killall nvidia-persistenced || true else HAS_NVIDIA_DRIVER=1 echo "NVIDIA driver ($INSTALLED_DRIVER_VERSION) has already been installed. Skipping NVIDIA driver installation" fi set -e fi if [ "$HAS_NVIDIA_DRIVER" -eq 0 ]; then # CAUTION: this may need to be updated in future if [ "${DISTRIBUTION}" != ubuntu20.04 ]; then sudo yum groupinstall -y "Development Tools" # ensure our kernel install is the same as our underlying kernel, # groupinstall "Development Tools" has a habit of mismatching kernel headers sudo yum install -y "kernel-devel-uname-r == $(uname -r)" sudo modprobe backlight fi sudo curl -fsL -o /tmp/nvidia_driver "https://s3.amazonaws.com/ossci-linux/nvidia_driver/$DRIVER_FN" set +e sudo /bin/bash /tmp/nvidia_driver -s --no-drm NVIDIA_INSTALLATION_STATUS=$? RESET_GPU=0 if [ "$NVIDIA_INSTALLATION_STATUS" -ne 0 ]; then sudo cat /var/log/nvidia-installer.log # Fail to install NVIDIA driver, try to reset the GPU RESET_GPU=1 elif [ -x "$(command -v nvidia-smi)" ]; then # Check again if nvidia-smi works even if the driver installation completes successfully INSTALLED_DRIVER_VERSION=$(nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0) NVIDIA_SMI_STATUS=$? if [ "$NVIDIA_SMI_STATUS" -ne 0 ] && [ "$NVIDIA_SMI_STATUS" -ne 14 ]; then RESET_GPU=1 fi fi if [ "$RESET_GPU" -eq 1 ]; then NVIDIA_DEVICES=$(lspci -D | grep -i NVIDIA | cut -d' ' -f1) # The GPU can get stuck in a failure state if somehow the test crashs the GPU microcode. When this # happens, we'll try to reset all NVIDIA devices https://github.com/pytorch/pytorch/issues/88388 for PCI_ID in $NVIDIA_DEVICES; do DEVICE_ENABLED=$(cat /sys/bus/pci/devices/$PCI_ID/enable) echo "Reseting $PCI_ID (enabled state: $DEVICE_ENABLED)" # This requires sudo permission of course echo "1" | sudo tee /sys/bus/pci/devices/$PCI_ID/reset sleep 1 done fi sudo rm -fv /tmp/nvidia_driver set -e fi ) } post_install_nvidia_driver_common() { ( sudo modprobe nvidia || true echo "After installing NVIDIA driver" lspci lsmod modinfo nvidia || true ( set +e nvidia-smi # NB: Annoyingly, nvidia-smi command returns successfully with return code 0 even in # the case where the driver has already crashed as it still can get the driver version # and some basic information like the bus ID. However, the rest of the information # would be missing (ERR!), for example: # # +-----------------------------------------------------------------------------+ # | NVIDIA-SMI 525.89.02 Driver Version: 525.89.02 CUDA Version: 12.0 | # |-------------------------------+----------------------+----------------------+ # | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | # | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | # | | | MIG M. | # |===============================+======================+======================| # | 0 ERR! Off | 00000000:00:1E.0 Off | ERR! | # |ERR! ERR! ERR! ERR! / ERR! | 4184MiB / 23028MiB | ERR! Default | # | | | ERR! | # +-------------------------------+----------------------+----------------------+ # # +-----------------------------------------------------------------------------+ # | Processes: | # | GPU GI CI PID Type Process name GPU Memory | # | ID ID Usage | # |=============================================================================| # +-----------------------------------------------------------------------------+ # # This should be reported as a failure instead as it will guarantee to fail when # Docker tries to run with --gpus all # # So, the correct check here is to query one of the missing piece of info like # GPU name, so that the command can fail accordingly nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 NVIDIA_SMI_STATUS=$? # Allowable exit statuses for nvidia-smi, see: https://github.com/NVIDIA/gpu-operator/issues/285 if [ "$NVIDIA_SMI_STATUS" -eq 0 ] || [ "$NVIDIA_SMI_STATUS" -eq 14 ]; then echo "INFO: Ignoring allowed status ${NVIDIA_SMI_STATUS}" else echo "ERROR: nvidia-smi exited with unresolved status ${NVIDIA_SMI_STATUS}" exit ${NVIDIA_SMI_STATUS} fi set -e ) ) } install_nvidia_driver_amzn2() { ( set -x pre_install_nvidia_driver_amzn2 install_nvidia_driver_common post_install_nvidia_driver_common ) } install_nvidia_driver_ubuntu20() { ( set -x install_nvidia_driver_common post_install_nvidia_driver_common ) } echo "== Installing nvidia driver ${DRIVER_FN} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_driver_amzn2 ;; ubuntu20.04) install_nvidia_driver_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Install container toolkit based on distribution echo "== Installing nvidia container toolkit for ${DISTRIBUTION} ==" case "${DISTRIBUTION}" in amzn*) install_nvidia_docker2_amzn2 ;; ubuntu20.04) install_nvidia_docker2_ubuntu20 ;; *) echo "ERROR: Unknown distribution ${DISTRIBUTION}" exit 1 ;; esac # Fix https://github.com/NVIDIA/nvidia-docker/issues/1648 on runners with # more than one GPUs. This just needs to be run once. The command fails # on subsequent runs and complains that the mode is already on, but that's # ok sudo nvidia-persistenced || true # This should show persistence mode ON nvidia-smi # check if the container-toolkit is correctly installed and CUDA is available inside a container docker run --rm -t --gpus=all public.ecr.aws/docker/library/python:3.13 nvidia-smi 2025-12-04T09:49:50.7758704Z retry_wait_seconds: 10 2025-12-04T09:49:50.7758920Z polling_interval_seconds: 1 2025-12-04T09:49:50.7759115Z warning_on_retry: true 2025-12-04T09:49:50.7759308Z continue_on_error: false 2025-12-04T09:49:50.7759489Z env: 2025-12-04T09:49:50.7759629Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:49:50.7759819Z HAS_NVIDIA_GPU: true 2025-12-04T09:49:50.7760050Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:49:50.7760316Z DRIVER_VERSION: 580.82.07 2025-12-04T09:49:50.7760503Z ##[endgroup] 2025-12-04T09:49:50.8451766Z == Installing nvidia driver NVIDIA-Linux-x86_64-580.82.07.run == 2025-12-04T09:49:50.8452688Z + pre_install_nvidia_driver_amzn2 2025-12-04T09:49:50.8455651Z + sudo yum remove -y nvidia-driver-latest-dkms 2025-12-04T09:49:51.4692532Z No match for argument: nvidia-driver-latest-dkms 2025-12-04T09:49:51.4693347Z No packages marked for removal. 2025-12-04T09:49:51.4749208Z Dependencies resolved. 2025-12-04T09:49:51.4758315Z Nothing to do. 2025-12-04T09:49:51.4759119Z Complete! 2025-12-04T09:49:51.5155790Z + install_nvidia_driver_common 2025-12-04T09:49:51.5158798Z + echo 'Before installing NVIDIA driver' 2025-12-04T09:49:51.5160257Z Before installing NVIDIA driver 2025-12-04T09:49:51.5162300Z + lspci 2025-12-04T09:49:51.5820557Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-12-04T09:49:51.5821043Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-12-04T09:49:51.5821583Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-12-04T09:49:51.5822082Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-12-04T09:49:51.5822527Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-12-04T09:49:51.5822946Z 01:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5823268Z 02:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5823585Z 03:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5823887Z 03:00.1 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5824181Z 03:00.2 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5824760Z 03:00.3 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5825091Z 03:00.4 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5825386Z 03:00.5 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5825696Z 03:00.6 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5825997Z 03:00.7 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5826309Z 03:01.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5826622Z 03:01.1 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5826919Z 03:01.2 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5827326Z 03:01.3 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5827636Z 03:01.4 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5827939Z 03:01.5 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5828228Z 03:01.6 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5828556Z 03:01.7 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5828979Z 03:02.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5829223Z 03:02.1 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5829460Z 03:02.2 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5829719Z 03:02.3 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5829957Z 03:02.4 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5830219Z 03:02.5 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5830460Z 03:02.6 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5830699Z 03:02.7 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5830936Z 03:03.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5831170Z 03:03.1 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5831404Z 03:03.2 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5831638Z 03:03.3 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5831869Z 03:03.4 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5832115Z 03:03.5 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5832351Z 03:03.6 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5832587Z 03:03.7 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5832989Z 24:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5833245Z 25:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5833485Z 26:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5833718Z 26:00.1 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5833958Z 26:00.2 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5834202Z 26:00.3 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5834444Z 26:00.4 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5834697Z 26:00.5 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5834937Z 26:00.6 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5835173Z 26:00.7 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5835419Z 26:01.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5835740Z 27:00.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-12-04T09:49:51.5836069Z 30:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5836304Z 31:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5836545Z 32:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5836859Z 33:00.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-12-04T09:49:51.5837174Z 34:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:51.5837455Z 35:00.0 3D controller: NVIDIA Corporation AD104GL [L4] (rev a1) 2025-12-04T09:49:51.5837712Z + lsmod 2025-12-04T09:49:51.5873644Z Module Size Used by 2025-12-04T09:49:51.5873960Z nvidia_uvm 1925120 0 2025-12-04T09:49:51.5874213Z nvidia 14286848 1 nvidia_uvm 2025-12-04T09:49:51.5874496Z drm 602112 1 nvidia 2025-12-04T09:49:51.5874772Z drm_panel_orientation_quirks 32768 1 drm 2025-12-04T09:49:51.5875283Z backlight 24576 1 drm 2025-12-04T09:49:51.5875571Z i2c_core 110592 2 nvidia,drm 2025-12-04T09:49:51.5875833Z xt_conntrack 16384 1 2025-12-04T09:49:51.5876070Z nft_chain_nat 16384 3 2025-12-04T09:49:51.5876299Z xt_MASQUERADE 20480 1 2025-12-04T09:49:51.5876568Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-12-04T09:49:51.5876878Z nf_conntrack_netlink 57344 0 2025-12-04T09:49:51.5877245Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-12-04T09:49:51.5877671Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-12-04T09:49:51.5877955Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-12-04T09:49:51.5878216Z xfrm_user 57344 1 2025-12-04T09:49:51.5878458Z xfrm_algo 16384 1 xfrm_user 2025-12-04T09:49:51.5878661Z xt_addrtype 16384 2 2025-12-04T09:49:51.5878844Z nft_compat 20480 4 2025-12-04T09:49:51.5879885Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-12-04T09:49:51.5880190Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-12-04T09:49:51.5880467Z br_netfilter 36864 0 2025-12-04T09:49:51.5880662Z bridge 323584 1 br_netfilter 2025-12-04T09:49:51.5880876Z stp 16384 1 bridge 2025-12-04T09:49:51.5881076Z llc 16384 2 bridge,stp 2025-12-04T09:49:51.5881279Z overlay 167936 0 2025-12-04T09:49:51.5881454Z tls 139264 0 2025-12-04T09:49:51.5881624Z nls_ascii 16384 1 2025-12-04T09:49:51.5881801Z nls_cp437 20480 1 2025-12-04T09:49:51.5881976Z vfat 24576 1 2025-12-04T09:49:51.5882148Z fat 86016 1 vfat 2025-12-04T09:49:51.5882340Z sunrpc 700416 1 2025-12-04T09:49:51.5882517Z i8042 45056 0 2025-12-04T09:49:51.5882682Z ena 184320 0 2025-12-04T09:49:51.5882874Z serio 28672 3 i8042 2025-12-04T09:49:51.5883080Z button 24576 0 2025-12-04T09:49:51.5883263Z ghash_clmulni_intel 16384 0 2025-12-04T09:49:51.5883452Z sch_fq_codel 20480 9 2025-12-04T09:49:51.5883632Z fuse 184320 1 2025-12-04T09:49:51.5883803Z dm_mod 188416 0 2025-12-04T09:49:51.5883974Z configfs 57344 1 2025-12-04T09:49:51.5884152Z loop 36864 0 2025-12-04T09:49:51.5884330Z dmi_sysfs 20480 0 2025-12-04T09:49:51.5884504Z crc32_pclmul 16384 0 2025-12-04T09:49:51.5884685Z crc32c_intel 24576 0 2025-12-04T09:49:51.5884867Z efivarfs 24576 1 2025-12-04T09:49:51.5885047Z + modinfo nvidia 2025-12-04T09:49:51.5894837Z filename: /lib/modules/6.1.150-174.273.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-12-04T09:49:51.5895294Z import_ns: DMA_BUF 2025-12-04T09:49:51.5895531Z alias: char-major-195-* 2025-12-04T09:49:51.5895774Z version: 580.82.07 2025-12-04T09:49:51.5896014Z supported: external 2025-12-04T09:49:51.5896246Z license: Dual MIT/GPL 2025-12-04T09:49:51.5896514Z firmware: nvidia/580.82.07/gsp_tu10x.bin 2025-12-04T09:49:51.5896833Z firmware: nvidia/580.82.07/gsp_ga10x.bin 2025-12-04T09:49:51.5897130Z srcversion: BA7240A71DCF7DC6FE88C1D 2025-12-04T09:49:51.5897439Z alias: of:N*T*Cnvidia,tegra264-displayC* 2025-12-04T09:49:51.5897758Z alias: of:N*T*Cnvidia,tegra264-display 2025-12-04T09:49:51.5898071Z alias: of:N*T*Cnvidia,tegra234-displayC* 2025-12-04T09:49:51.5898385Z alias: of:N*T*Cnvidia,tegra234-display 2025-12-04T09:49:51.5898665Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-12-04T09:49:51.5898908Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-12-04T09:49:51.5899145Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-12-04T09:49:51.5899366Z depends: i2c-core,drm 2025-12-04T09:49:51.5899545Z retpoline: Y 2025-12-04T09:49:51.5899847Z name: nvidia 2025-12-04T09:49:51.5900124Z vermagic: 6.1.150-174.273.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-12-04T09:49:51.5900472Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-12-04T09:49:51.5900796Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-12-04T09:49:51.5901101Z parm: NVreg_ResmanDebugLevel:int 2025-12-04T09:49:51.5901318Z parm: NVreg_RmLogonRC:int 2025-12-04T09:49:51.5901539Z parm: NVreg_ModifyDeviceFiles:int 2025-12-04T09:49:51.5901764Z parm: NVreg_DeviceFileUID:int 2025-12-04T09:49:51.5901974Z parm: NVreg_DeviceFileGID:int 2025-12-04T09:49:51.5902194Z parm: NVreg_DeviceFileMode:int 2025-12-04T09:49:51.5902452Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-12-04T09:49:51.5902733Z parm: NVreg_UsePageAttributeTable:int 2025-12-04T09:49:51.5902969Z parm: NVreg_EnablePCIeGen3:int 2025-12-04T09:49:51.5903276Z parm: NVreg_EnableMSI:int 2025-12-04T09:49:51.5903499Z parm: NVreg_EnableStreamMemOPs:int 2025-12-04T09:49:51.5903762Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-12-04T09:49:51.5904052Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-12-04T09:49:51.5904325Z parm: NVreg_EnableS0ixPowerManagement:int 2025-12-04T09:49:51.5904619Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-12-04T09:49:51.5904924Z parm: NVreg_DynamicPowerManagement:int 2025-12-04T09:49:51.5905232Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-12-04T09:49:51.5905534Z parm: NVreg_EnableGpuFirmware:int 2025-12-04T09:49:51.5905775Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-12-04T09:49:51.5906058Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-12-04T09:49:51.5906334Z parm: NVreg_EnableUserNUMAManagement:int 2025-12-04T09:49:51.5906576Z parm: NVreg_MemoryPoolSize:int 2025-12-04T09:49:51.5906819Z parm: NVreg_KMallocHeapMaxSize:int 2025-12-04T09:49:51.5907066Z parm: NVreg_VMallocHeapMaxSize:int 2025-12-04T09:49:51.5907417Z parm: NVreg_IgnoreMMIOCheck:int 2025-12-04T09:49:51.5907647Z parm: NVreg_NvLinkDisable:int 2025-12-04T09:49:51.5907905Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-12-04T09:49:51.5908173Z parm: NVreg_RegisterPCIDriver:int 2025-12-04T09:49:51.5908427Z parm: NVreg_RegisterPlatformDeviceDriver:int 2025-12-04T09:49:51.5908689Z parm: NVreg_EnableResizableBar:int 2025-12-04T09:49:51.5908931Z parm: NVreg_EnableDbgBreakpoint:int 2025-12-04T09:49:51.5909175Z parm: NVreg_EnableNonblockingOpen:int 2025-12-04T09:49:51.5909437Z parm: NVreg_CoherentGPUMemoryMode:charp 2025-12-04T09:49:51.5909687Z parm: NVreg_RegistryDwords:charp 2025-12-04T09:49:51.5909933Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-12-04T09:49:51.5910180Z parm: NVreg_RmMsg:charp 2025-12-04T09:49:51.5910400Z parm: NVreg_GpuBlacklist:charp 2025-12-04T09:49:51.5910633Z parm: NVreg_TemporaryFilePath:charp 2025-12-04T09:49:51.5910872Z parm: NVreg_ExcludedGpus:charp 2025-12-04T09:49:51.5911103Z parm: NVreg_DmaRemapPeerMmio:int 2025-12-04T09:49:51.5911355Z parm: NVreg_RmNvlinkBandwidth:charp 2025-12-04T09:49:51.5911610Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-12-04T09:49:51.5911861Z parm: NVreg_ImexChannelCount:int 2025-12-04T09:49:51.5912092Z parm: NVreg_CreateImexChannel0:int 2025-12-04T09:49:51.5912332Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-12-04T09:49:51.5912576Z parm: rm_firmware_active:charp 2025-12-04T09:49:51.5912803Z + HAS_NVIDIA_DRIVER=0 2025-12-04T09:49:51.5912977Z ++ command -v nvidia-smi 2025-12-04T09:49:51.5913169Z + '[' -x /usr/bin/nvidia-smi ']' 2025-12-04T09:49:51.5913360Z + set +e 2025-12-04T09:49:51.5913683Z ++ nvidia-smi --query-gpu=driver_version --format=csv,noheader --id=0 2025-12-04T09:49:53.2224886Z + INSTALLED_DRIVER_VERSION=580.82.07 2025-12-04T09:49:53.2225250Z + NVIDIA_SMI_STATUS=0 2025-12-04T09:49:53.2225969Z + '[' 0 -ne 0 ']' 2025-12-04T09:49:53.2226181Z + '[' 580.82.07 '!=' 580.82.07 ']' 2025-12-04T09:49:53.2226441Z + HAS_NVIDIA_DRIVER=1 2025-12-04T09:49:53.2226846Z + echo 'NVIDIA driver (580.82.07) has already been installed. Skipping NVIDIA driver installation' 2025-12-04T09:49:53.2227395Z + set -e 2025-12-04T09:49:53.2227578Z + '[' 1 -eq 0 ']' 2025-12-04T09:49:53.2227941Z NVIDIA driver (580.82.07) has already been installed. Skipping NVIDIA driver installation 2025-12-04T09:49:53.2228390Z + post_install_nvidia_driver_common 2025-12-04T09:49:53.2231038Z + sudo modprobe nvidia 2025-12-04T09:49:53.3851935Z + echo 'After installing NVIDIA driver' 2025-12-04T09:49:53.3852372Z + lspci 2025-12-04T09:49:53.3852648Z After installing NVIDIA driver 2025-12-04T09:49:53.4024294Z 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] 2025-12-04T09:49:53.4025557Z 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II] 2025-12-04T09:49:53.4026445Z 00:01.3 Non-VGA unclassified device: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 08) 2025-12-04T09:49:53.4027418Z 00:03.0 VGA compatible controller: Amazon.com, Inc. Device 1111 2025-12-04T09:49:53.4028204Z 00:04.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe EBS Controller 2025-12-04T09:49:53.4028981Z 01:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4029552Z 02:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4030003Z 03:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4030432Z 03:00.1 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4030880Z 03:00.2 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4031307Z 03:00.3 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4031733Z 03:00.4 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4032147Z 03:00.5 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4032575Z 03:00.6 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4032995Z 03:00.7 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4033395Z 03:01.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4033793Z 03:01.1 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4034172Z 03:01.2 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4034616Z 03:01.3 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4034875Z 03:01.4 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4035122Z 03:01.5 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4035360Z 03:01.6 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4035598Z 03:01.7 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4035827Z 03:02.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4036080Z 03:02.1 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4036317Z 03:02.2 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4036564Z 03:02.3 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4036803Z 03:02.4 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4037042Z 03:02.5 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4037273Z 03:02.6 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4037508Z 03:02.7 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4037746Z 03:03.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4037982Z 03:03.1 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4038213Z 03:03.2 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4038448Z 03:03.3 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4038683Z 03:03.4 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4038921Z 03:03.5 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4039158Z 03:03.6 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4039407Z 03:03.7 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4039848Z 24:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4040119Z 25:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4040362Z 26:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4040599Z 26:00.1 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4040833Z 26:00.2 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4041075Z 26:00.3 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4041315Z 26:00.4 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4041547Z 26:00.5 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4041782Z 26:00.6 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4042169Z 26:00.7 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4042402Z 26:01.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4042717Z 27:00.0 Ethernet controller: Amazon.com, Inc. Elastic Network Adapter (ENA) 2025-12-04T09:49:53.4043039Z 30:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4043387Z 31:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4043623Z 32:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4043938Z 33:00.0 Non-Volatile memory controller: Amazon.com, Inc. NVMe SSD Controller 2025-12-04T09:49:53.4044277Z 34:00.0 PCI bridge: Amazon.com, Inc. Device 0200 2025-12-04T09:49:53.4044560Z 35:00.0 3D controller: NVIDIA Corporation AD104GL [L4] (rev a1) 2025-12-04T09:49:53.4044824Z + lsmod 2025-12-04T09:49:53.4062283Z Module Size Used by 2025-12-04T09:49:53.4062578Z nvidia_uvm 1925120 0 2025-12-04T09:49:53.4062832Z nvidia 14286848 1 nvidia_uvm 2025-12-04T09:49:53.4063110Z drm 602112 1 nvidia 2025-12-04T09:49:53.4063402Z drm_panel_orientation_quirks 32768 1 drm 2025-12-04T09:49:53.4063684Z backlight 24576 1 drm 2025-12-04T09:49:53.4063949Z i2c_core 110592 2 nvidia,drm 2025-12-04T09:49:53.4064217Z xt_conntrack 16384 1 2025-12-04T09:49:53.4064456Z nft_chain_nat 16384 3 2025-12-04T09:49:53.4064684Z xt_MASQUERADE 20480 1 2025-12-04T09:49:53.4064950Z nf_nat 57344 2 nft_chain_nat,xt_MASQUERADE 2025-12-04T09:49:53.4065250Z nf_conntrack_netlink 57344 0 2025-12-04T09:49:53.4065606Z nf_conntrack 184320 4 xt_conntrack,nf_nat,nf_conntrack_netlink,xt_MASQUERADE 2025-12-04T09:49:53.4066011Z nf_defrag_ipv6 24576 1 nf_conntrack 2025-12-04T09:49:53.4066297Z nf_defrag_ipv4 16384 1 nf_conntrack 2025-12-04T09:49:53.4066553Z xfrm_user 57344 1 2025-12-04T09:49:53.4066786Z xfrm_algo 16384 1 xfrm_user 2025-12-04T09:49:53.4067041Z xt_addrtype 16384 2 2025-12-04T09:49:53.4067354Z nft_compat 20480 4 2025-12-04T09:49:53.4067629Z nf_tables 311296 57 nft_compat,nft_chain_nat 2025-12-04T09:49:53.4068003Z nfnetlink 20480 4 nft_compat,nf_conntrack_netlink,nf_tables 2025-12-04T09:49:53.4068365Z br_netfilter 36864 0 2025-12-04T09:49:53.4068607Z bridge 323584 1 br_netfilter 2025-12-04T09:49:53.4068869Z stp 16384 1 bridge 2025-12-04T09:49:53.4069113Z llc 16384 2 bridge,stp 2025-12-04T09:49:53.4069310Z overlay 167936 0 2025-12-04T09:49:53.4069498Z tls 139264 0 2025-12-04T09:49:53.4069674Z nls_ascii 16384 1 2025-12-04T09:49:53.4069845Z nls_cp437 20480 1 2025-12-04T09:49:53.4070019Z vfat 24576 1 2025-12-04T09:49:53.4070198Z fat 86016 1 vfat 2025-12-04T09:49:53.4070385Z sunrpc 700416 1 2025-12-04T09:49:53.4070561Z i8042 45056 0 2025-12-04T09:49:53.4070729Z ena 184320 0 2025-12-04T09:49:53.4070902Z serio 28672 3 i8042 2025-12-04T09:49:53.4071097Z button 24576 0 2025-12-04T09:49:53.4071275Z ghash_clmulni_intel 16384 0 2025-12-04T09:49:53.4071474Z sch_fq_codel 20480 9 2025-12-04T09:49:53.4071811Z fuse 184320 1 2025-12-04T09:49:53.4072020Z dm_mod 188416 0 2025-12-04T09:49:53.4072210Z configfs 57344 1 2025-12-04T09:49:53.4072383Z loop 36864 0 2025-12-04T09:49:53.4072560Z dmi_sysfs 20480 0 2025-12-04T09:49:53.4072851Z crc32_pclmul 16384 0 2025-12-04T09:49:53.4073059Z crc32c_intel 24576 0 2025-12-04T09:49:53.4073256Z efivarfs 24576 1 2025-12-04T09:49:53.4073438Z + modinfo nvidia 2025-12-04T09:49:53.4079848Z filename: /lib/modules/6.1.150-174.273.amzn2023.x86_64/kernel/drivers/video/nvidia.ko 2025-12-04T09:49:53.4080280Z import_ns: DMA_BUF 2025-12-04T09:49:53.4080472Z alias: char-major-195-* 2025-12-04T09:49:53.4080677Z version: 580.82.07 2025-12-04T09:49:53.4080856Z supported: external 2025-12-04T09:49:53.4081041Z license: Dual MIT/GPL 2025-12-04T09:49:53.4081255Z firmware: nvidia/580.82.07/gsp_tu10x.bin 2025-12-04T09:49:53.4081692Z firmware: nvidia/580.82.07/gsp_ga10x.bin 2025-12-04T09:49:53.4081934Z srcversion: BA7240A71DCF7DC6FE88C1D 2025-12-04T09:49:53.4082180Z alias: of:N*T*Cnvidia,tegra264-displayC* 2025-12-04T09:49:53.4082444Z alias: of:N*T*Cnvidia,tegra264-display 2025-12-04T09:49:53.4082718Z alias: of:N*T*Cnvidia,tegra234-displayC* 2025-12-04T09:49:53.4083082Z alias: of:N*T*Cnvidia,tegra234-display 2025-12-04T09:49:53.4083333Z alias: pci:v000010DEd*sv*sd*bc06sc80i00* 2025-12-04T09:49:53.4083564Z alias: pci:v000010DEd*sv*sd*bc03sc02i00* 2025-12-04T09:49:53.4083794Z alias: pci:v000010DEd*sv*sd*bc03sc00i00* 2025-12-04T09:49:53.4084016Z depends: i2c-core,drm 2025-12-04T09:49:53.4084193Z retpoline: Y 2025-12-04T09:49:53.4084464Z name: nvidia 2025-12-04T09:49:53.4084889Z vermagic: 6.1.150-174.273.amzn2023.x86_64 SMP preempt mod_unload modversions 2025-12-04T09:49:53.4085319Z parm: NvSwitchRegDwords:NvSwitch regkey (charp) 2025-12-04T09:49:53.4085660Z parm: NvSwitchBlacklist:NvSwitchBlacklist=uuid[,uuid...] (charp) 2025-12-04T09:49:53.4085974Z parm: NVreg_ResmanDebugLevel:int 2025-12-04T09:49:53.4086202Z parm: NVreg_RmLogonRC:int 2025-12-04T09:49:53.4086417Z parm: NVreg_ModifyDeviceFiles:int 2025-12-04T09:49:53.4086649Z parm: NVreg_DeviceFileUID:int 2025-12-04T09:49:53.4086869Z parm: NVreg_DeviceFileGID:int 2025-12-04T09:49:53.4087082Z parm: NVreg_DeviceFileMode:int 2025-12-04T09:49:53.4087358Z parm: NVreg_InitializeSystemMemoryAllocations:int 2025-12-04T09:49:53.4087642Z parm: NVreg_UsePageAttributeTable:int 2025-12-04T09:49:53.4087882Z parm: NVreg_EnablePCIeGen3:int 2025-12-04T09:49:53.4088096Z parm: NVreg_EnableMSI:int 2025-12-04T09:49:53.4088318Z parm: NVreg_EnableStreamMemOPs:int 2025-12-04T09:49:53.4088594Z parm: NVreg_RestrictProfilingToAdminUsers:int 2025-12-04T09:49:53.4088888Z parm: NVreg_PreserveVideoMemoryAllocations:int 2025-12-04T09:49:53.4089163Z parm: NVreg_EnableS0ixPowerManagement:int 2025-12-04T09:49:53.4089466Z parm: NVreg_S0ixPowerManagementVideoMemoryThreshold:int 2025-12-04T09:49:53.4089757Z parm: NVreg_DynamicPowerManagement:int 2025-12-04T09:49:53.4090076Z parm: NVreg_DynamicPowerManagementVideoMemoryThreshold:int 2025-12-04T09:49:53.4090377Z parm: NVreg_EnableGpuFirmware:int 2025-12-04T09:49:53.4090621Z parm: NVreg_EnableGpuFirmwareLogs:int 2025-12-04T09:49:53.4090883Z parm: NVreg_OpenRmEnableUnsupportedGpus:int 2025-12-04T09:49:53.4091158Z parm: NVreg_EnableUserNUMAManagement:int 2025-12-04T09:49:53.4091403Z parm: NVreg_MemoryPoolSize:int 2025-12-04T09:49:53.4091629Z parm: NVreg_KMallocHeapMaxSize:int 2025-12-04T09:49:53.4091871Z parm: NVreg_VMallocHeapMaxSize:int 2025-12-04T09:49:53.4092105Z parm: NVreg_IgnoreMMIOCheck:int 2025-12-04T09:49:53.4092455Z parm: NVreg_NvLinkDisable:int 2025-12-04T09:49:53.4092727Z parm: NVreg_EnablePCIERelaxedOrderingMode:int 2025-12-04T09:49:53.4093005Z parm: NVreg_RegisterPCIDriver:int 2025-12-04T09:49:53.4093272Z parm: NVreg_RegisterPlatformDeviceDriver:int 2025-12-04T09:49:53.4093531Z parm: NVreg_EnableResizableBar:int 2025-12-04T09:49:53.4093785Z parm: NVreg_EnableDbgBreakpoint:int 2025-12-04T09:49:53.4094041Z parm: NVreg_EnableNonblockingOpen:int 2025-12-04T09:49:53.4094294Z parm: NVreg_CoherentGPUMemoryMode:charp 2025-12-04T09:49:53.4094541Z parm: NVreg_RegistryDwords:charp 2025-12-04T09:49:53.4094786Z parm: NVreg_RegistryDwordsPerDevice:charp 2025-12-04T09:49:53.4095021Z parm: NVreg_RmMsg:charp 2025-12-04T09:49:53.4095228Z parm: NVreg_GpuBlacklist:charp 2025-12-04T09:49:53.4095462Z parm: NVreg_TemporaryFilePath:charp 2025-12-04T09:49:53.4095692Z parm: NVreg_ExcludedGpus:charp 2025-12-04T09:49:53.4095993Z parm: NVreg_DmaRemapPeerMmio:int 2025-12-04T09:49:53.4096226Z parm: NVreg_RmNvlinkBandwidth:charp 2025-12-04T09:49:53.4096479Z parm: NVreg_RmNvlinkBandwidthLinkCount:int 2025-12-04T09:49:53.4096722Z parm: NVreg_ImexChannelCount:int 2025-12-04T09:49:53.4096953Z parm: NVreg_CreateImexChannel0:int 2025-12-04T09:49:53.4097198Z parm: NVreg_GrdmaPciTopoCheckOverride:int 2025-12-04T09:49:53.4097434Z parm: rm_firmware_active:charp 2025-12-04T09:49:53.4097638Z + set +e 2025-12-04T09:49:53.4097784Z + nvidia-smi 2025-12-04T09:49:54.8445055Z Thu Dec 4 09:49:54 2025 2025-12-04T09:49:54.8445505Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:49:54.8446006Z | NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 | 2025-12-04T09:49:54.8446495Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:49:54.8447065Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-12-04T09:49:54.8447951Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-12-04T09:49:54.8448463Z | | | MIG M. | 2025-12-04T09:49:54.8448765Z |=========================================+========================+======================| 2025-12-04T09:49:54.8514203Z | 0 NVIDIA L4 Off | 00000000:35:00.0 Off | 0 | 2025-12-04T09:49:54.8514931Z | N/A 35C P0 28W / 72W | 0MiB / 23034MiB | 2% Default | 2025-12-04T09:49:54.8515529Z | | | N/A | 2025-12-04T09:49:54.8516162Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:49:54.8516650Z 2025-12-04T09:49:54.8516932Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:49:54.8517576Z | Processes: | 2025-12-04T09:49:54.8518011Z | GPU GI CI PID Type Process name GPU Memory | 2025-12-04T09:49:54.8518405Z | ID ID Usage | 2025-12-04T09:49:54.8518709Z |=========================================================================================| 2025-12-04T09:49:54.8519300Z | No running processes found | 2025-12-04T09:49:54.8519825Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:49:55.1773578Z + nvidia-smi --query-gpu=gpu_name --format=csv,noheader --id=0 2025-12-04T09:49:56.6101436Z NVIDIA L4 2025-12-04T09:49:56.7939308Z + NVIDIA_SMI_STATUS=0 2025-12-04T09:49:56.7939950Z + '[' 0 -eq 0 ']' 2025-12-04T09:49:56.7940197Z + echo 'INFO: Ignoring allowed status 0' 2025-12-04T09:49:56.7940466Z + set -e 2025-12-04T09:49:56.7940660Z INFO: Ignoring allowed status 0 2025-12-04T09:49:56.7948513Z == Installing nvidia container toolkit for amzn2023 == 2025-12-04T09:49:56.7952200Z + sudo yum install -y yum-utils 2025-12-04T09:49:57.2204214Z Last metadata expiration check: 0:08:06 ago on Thu Dec 4 09:41:51 2025. 2025-12-04T09:49:57.2431385Z Package dnf-utils-4.3.0-13.amzn2023.0.5.noarch is already installed. 2025-12-04T09:49:57.2846534Z Dependencies resolved. 2025-12-04T09:49:57.3089299Z Nothing to do. 2025-12-04T09:49:57.3089819Z Complete! 2025-12-04T09:49:57.3760617Z + [[ amzn2023 == \a\m\z\n\2\0\2\3 ]] 2025-12-04T09:49:57.3761235Z + YUM_REPO_URL=https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-12-04T09:49:57.3762115Z + sudo yum-config-manager --add-repo https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-12-04T09:49:57.7057640Z Adding repo from: https://nvidia.github.io/libnvidia-container/stable/rpm/nvidia-container-toolkit.repo 2025-12-04T09:49:57.7540087Z + sudo yum install -y nvidia-container-toolkit-1.17.8 libnvidia-container-tools-1.17.8 libnvidia-container1-1.17.8 nvidia-container-toolkit-base-1.17.8 2025-12-04T09:49:58.2526523Z nvidia-container-toolkit 23 kB/s | 833 B 00:00 2025-12-04T09:49:58.3201125Z Dependencies resolved. 2025-12-04T09:49:58.3429398Z ================================================================================ 2025-12-04T09:49:58.3429846Z Package Arch Version Repository Size 2025-12-04T09:49:58.3430214Z ================================================================================ 2025-12-04T09:49:58.3430495Z Downgrading: 2025-12-04T09:49:58.3430847Z libnvidia-container-tools x86_64 1.17.8-1 nvidia-container-toolkit 40 k 2025-12-04T09:49:58.3431449Z libnvidia-container1 x86_64 1.17.8-1 nvidia-container-toolkit 1.0 M 2025-12-04T09:49:58.3431953Z nvidia-container-toolkit x86_64 1.17.8-1 nvidia-container-toolkit 1.2 M 2025-12-04T09:49:58.3432485Z nvidia-container-toolkit-base x86_64 1.17.8-1 nvidia-container-toolkit 5.8 M 2025-12-04T09:49:58.3432814Z 2025-12-04T09:49:58.3432898Z Transaction Summary 2025-12-04T09:49:58.3433127Z ================================================================================ 2025-12-04T09:49:58.3433421Z Downgrade 4 Packages 2025-12-04T09:49:58.3433576Z 2025-12-04T09:49:58.3433677Z Total download size: 8.0 M 2025-12-04T09:49:58.3434824Z Downloading Packages: 2025-12-04T09:49:58.3676248Z (1/4): libnvidia-container-tools-1.17.8-1.x86_6 1.7 MB/s | 40 kB 00:00 2025-12-04T09:49:58.4164057Z (2/4): nvidia-container-toolkit-1.17.8-1.x86_64 17 MB/s | 1.2 MB 00:00 2025-12-04T09:49:58.4429629Z (3/4): libnvidia-container1-1.17.8-1.x86_64.rpm 10 MB/s | 1.0 MB 00:00 2025-12-04T09:49:58.5373698Z (4/4): nvidia-container-toolkit-base-1.17.8-1.x 34 MB/s | 5.8 MB 00:00 2025-12-04T09:49:58.5384735Z -------------------------------------------------------------------------------- 2025-12-04T09:49:58.5387708Z Total 41 MB/s | 8.0 MB 00:00 2025-12-04T09:49:58.5390191Z Running transaction check 2025-12-04T09:49:58.5507595Z Transaction check succeeded. 2025-12-04T09:49:58.5508095Z Running transaction test 2025-12-04T09:49:58.5940102Z Transaction test succeeded. 2025-12-04T09:49:58.5942986Z Running transaction 2025-12-04T09:49:59.1477504Z Preparing : 1/1 2025-12-04T09:49:59.2739607Z Downgrading : nvidia-container-toolkit-base-1.17.8-1.x86_64 1/8 2025-12-04T09:49:59.3182195Z Downgrading : libnvidia-container1-1.17.8-1.x86_64 2/8 2025-12-04T09:49:59.3821627Z Running scriptlet: libnvidia-container1-1.17.8-1.x86_64 2/8 2025-12-04T09:49:59.4784257Z Downgrading : libnvidia-container-tools-1.17.8-1.x86_64 3/8 2025-12-04T09:49:59.5088404Z Downgrading : nvidia-container-toolkit-1.17.8-1.x86_64 4/8 2025-12-04T09:49:59.5575855Z Running scriptlet: nvidia-container-toolkit-1.17.8-1.x86_64 4/8 2025-12-04T09:49:59.5639046Z Running scriptlet: nvidia-container-toolkit-1.18.1-1.x86_64 5/8 2025-12-04T09:49:59.5641346Z Cleanup : nvidia-container-toolkit-1.18.1-1.x86_64 5/8 2025-12-04T09:49:59.5996897Z Running scriptlet: nvidia-container-toolkit-1.18.1-1.x86_64 5/8 2025-12-04T09:49:59.6049516Z Running scriptlet: libnvidia-container-tools-1.18.1-1.x86_64 6/8 2025-12-04T09:49:59.6050839Z Cleanup : libnvidia-container-tools-1.18.1-1.x86_64 6/8 2025-12-04T09:49:59.6276677Z Running scriptlet: libnvidia-container-tools-1.18.1-1.x86_64 6/8 2025-12-04T09:49:59.6336572Z Running scriptlet: libnvidia-container1-1.18.1-1.x86_64 7/8 2025-12-04T09:49:59.6337680Z Cleanup : libnvidia-container1-1.18.1-1.x86_64 7/8 2025-12-04T09:49:59.6608265Z Running scriptlet: libnvidia-container1-1.18.1-1.x86_64 7/8 2025-12-04T09:49:59.6669849Z Running scriptlet: nvidia-container-toolkit-base-1.18.1-1.x86_64 8/8 2025-12-04T09:49:59.6670944Z Cleanup : nvidia-container-toolkit-base-1.18.1-1.x86_64 8/8 2025-12-04T09:49:59.6943475Z Running scriptlet: nvidia-container-toolkit-base-1.18.1-1.x86_64 8/8 2025-12-04T09:49:59.7393806Z Running scriptlet: nvidia-container-toolkit-1.17.8-1.x86_64 8/8 2025-12-04T09:50:45.4300161Z Running scriptlet: nvidia-container-toolkit-base-1.18.1-1.x86_64 8/8 2025-12-04T09:50:45.4302452Z Verifying : libnvidia-container-tools-1.17.8-1.x86_64 1/8 2025-12-04T09:50:45.4303052Z Verifying : libnvidia-container-tools-1.18.1-1.x86_64 2/8 2025-12-04T09:50:45.4303589Z Verifying : libnvidia-container1-1.17.8-1.x86_64 3/8 2025-12-04T09:50:45.4304069Z Verifying : libnvidia-container1-1.18.1-1.x86_64 4/8 2025-12-04T09:50:45.4304542Z Verifying : nvidia-container-toolkit-1.17.8-1.x86_64 5/8 2025-12-04T09:50:45.4305008Z Verifying : nvidia-container-toolkit-1.18.1-1.x86_64 6/8 2025-12-04T09:50:45.4305494Z Verifying : nvidia-container-toolkit-base-1.17.8-1.x86_64 7/8 2025-12-04T09:50:45.5683013Z Verifying : nvidia-container-toolkit-base-1.18.1-1.x86_64 8/8================================================================================ 2025-12-04T09:50:45.5683591Z WARNING: 2025-12-04T09:50:45.5683831Z A newer release of "Amazon Linux" is available. 2025-12-04T09:50:45.5684051Z 2025-12-04T09:50:45.5684140Z Available Versions: 2025-12-04T09:50:45.5684289Z 2025-12-04T09:50:45.5684406Z Version 2023.9.20250929: 2025-12-04T09:50:45.5684709Z Run the following command to upgrade to 2023.9.20250929: 2025-12-04T09:50:45.5684947Z 2025-12-04T09:50:45.5685086Z dnf upgrade --releasever=2023.9.20250929 2025-12-04T09:50:45.5685287Z 2025-12-04T09:50:45.5685366Z Release notes: 2025-12-04T09:50:45.5685768Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20250929.html 2025-12-04T09:50:45.5686119Z 2025-12-04T09:50:45.5686206Z Version 2023.9.20251014: 2025-12-04T09:50:45.5686495Z Run the following command to upgrade to 2023.9.20251014: 2025-12-04T09:50:45.5686727Z 2025-12-04T09:50:45.5686834Z dnf upgrade --releasever=2023.9.20251014 2025-12-04T09:50:45.5687032Z 2025-12-04T09:50:45.5687106Z Release notes: 2025-12-04T09:50:45.5687471Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20251014.html 2025-12-04T09:50:45.5687809Z 2025-12-04T09:50:45.5687887Z Version 2023.9.20251020: 2025-12-04T09:50:45.5688462Z Run the following command to upgrade to 2023.9.20251020: 2025-12-04T09:50:45.5688726Z 2025-12-04T09:50:45.5688835Z dnf upgrade --releasever=2023.9.20251020 2025-12-04T09:50:45.5689028Z 2025-12-04T09:50:45.5689107Z Release notes: 2025-12-04T09:50:45.5689466Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20251020.html 2025-12-04T09:50:45.5689810Z 2025-12-04T09:50:45.5689889Z Version 2023.9.20251027: 2025-12-04T09:50:45.5690163Z Run the following command to upgrade to 2023.9.20251027: 2025-12-04T09:50:45.5690384Z 2025-12-04T09:50:45.5690491Z dnf upgrade --releasever=2023.9.20251027 2025-12-04T09:50:45.5690675Z 2025-12-04T09:50:45.5690749Z Release notes: 2025-12-04T09:50:45.5691098Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20251027.html 2025-12-04T09:50:45.5691425Z 2025-12-04T09:50:45.5691508Z Version 2023.9.20251105: 2025-12-04T09:50:45.5691767Z Run the following command to upgrade to 2023.9.20251105: 2025-12-04T09:50:45.5692204Z 2025-12-04T09:50:45.5692310Z dnf upgrade --releasever=2023.9.20251105 2025-12-04T09:50:45.5692504Z 2025-12-04T09:50:45.5692590Z Release notes: 2025-12-04T09:50:45.5692941Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20251105.html 2025-12-04T09:50:45.5693268Z 2025-12-04T09:50:45.5693344Z Version 2023.9.20251110: 2025-12-04T09:50:45.5693611Z Run the following command to upgrade to 2023.9.20251110: 2025-12-04T09:50:45.5693830Z 2025-12-04T09:50:45.5693936Z dnf upgrade --releasever=2023.9.20251110 2025-12-04T09:50:45.5694120Z 2025-12-04T09:50:45.5694197Z Release notes: 2025-12-04T09:50:45.5694552Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20251110.html 2025-12-04T09:50:45.5694894Z 2025-12-04T09:50:45.5694978Z Version 2023.9.20251117: 2025-12-04T09:50:45.5695246Z Run the following command to upgrade to 2023.9.20251117: 2025-12-04T09:50:45.5695469Z 2025-12-04T09:50:45.5695572Z dnf upgrade --releasever=2023.9.20251117 2025-12-04T09:50:45.5695773Z 2025-12-04T09:50:45.5695847Z Release notes: 2025-12-04T09:50:45.5696139Z https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.9.20251117.html 2025-12-04T09:50:45.5696402Z 2025-12-04T09:50:45.5696490Z ================================================================================ 2025-12-04T09:50:45.6152446Z 2025-12-04T09:50:45.6152576Z 2025-12-04T09:50:45.6152655Z Downgraded: 2025-12-04T09:50:45.6153002Z libnvidia-container-tools-1.17.8-1.x86_64 2025-12-04T09:50:45.6153520Z libnvidia-container1-1.17.8-1.x86_64 2025-12-04T09:50:45.6154025Z nvidia-container-toolkit-1.17.8-1.x86_64 2025-12-04T09:50:45.6154550Z nvidia-container-toolkit-base-1.17.8-1.x86_64 2025-12-04T09:50:45.6154864Z 2025-12-04T09:50:45.6154945Z Complete! 2025-12-04T09:50:45.6623972Z + sudo systemctl restart docker 2025-12-04T09:50:51.4412546Z Thu Dec 4 09:50:51 2025 2025-12-04T09:50:51.4413152Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:50:51.4413897Z | NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 | 2025-12-04T09:50:51.4414566Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:50:51.4415266Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-12-04T09:50:51.4416000Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-12-04T09:50:51.4416611Z | | | MIG M. | 2025-12-04T09:50:51.4417123Z |=========================================+========================+======================| 2025-12-04T09:50:51.4489819Z | 0 NVIDIA L4 On | 00000000:35:00.0 Off | 0 | 2025-12-04T09:50:51.4490540Z | N/A 35C P0 29W / 72W | 0MiB / 23034MiB | 4% Default | 2025-12-04T09:50:51.4490903Z | | | N/A | 2025-12-04T09:50:51.4491254Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:50:51.4491529Z 2025-12-04T09:50:51.4491683Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:50:51.4492070Z | Processes: | 2025-12-04T09:50:51.4492501Z | GPU GI CI PID Type Process name GPU Memory | 2025-12-04T09:50:51.4492879Z | ID ID Usage | 2025-12-04T09:50:51.4493188Z |=========================================================================================| 2025-12-04T09:50:51.4494593Z | No running processes found | 2025-12-04T09:50:51.6105267Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:50:51.6105765Z Unable to find image 'public.ecr.aws/docker/library/python:3.13' locally 2025-12-04T09:50:51.8191364Z 3.13: Pulling from docker/library/python 2025-12-04T09:50:51.9025391Z 53c88f1dfeb7: Pulling fs layer 2025-12-04T09:50:51.9025761Z eae668646f44: Pulling fs layer 2025-12-04T09:50:51.9026023Z ff2e6e687b6c: Pulling fs layer 2025-12-04T09:50:51.9026289Z 7c40a3faff76: Pulling fs layer 2025-12-04T09:50:51.9026556Z 967a3b1c8fef: Pulling fs layer 2025-12-04T09:50:51.9026849Z a64e1a44f22a: Pulling fs layer 2025-12-04T09:50:51.9027407Z 52655f8a5bcc: Pulling fs layer 2025-12-04T09:50:51.9027841Z 7c40a3faff76: Waiting 2025-12-04T09:50:51.9028136Z 967a3b1c8fef: Waiting 2025-12-04T09:50:51.9028371Z a64e1a44f22a: Waiting 2025-12-04T09:50:51.9028645Z 52655f8a5bcc: Waiting 2025-12-04T09:50:52.0207788Z eae668646f44: Verifying Checksum 2025-12-04T09:50:52.0208089Z eae668646f44: Download complete 2025-12-04T09:50:52.0599150Z 53c88f1dfeb7: Verifying Checksum 2025-12-04T09:50:52.0599688Z 53c88f1dfeb7: Download complete 2025-12-04T09:50:52.1269126Z 967a3b1c8fef: Verifying Checksum 2025-12-04T09:50:52.1269584Z 967a3b1c8fef: Download complete 2025-12-04T09:50:52.1925700Z ff2e6e687b6c: Verifying Checksum 2025-12-04T09:50:52.1926004Z ff2e6e687b6c: Download complete 2025-12-04T09:50:52.2426873Z 52655f8a5bcc: Download complete 2025-12-04T09:50:52.2474159Z a64e1a44f22a: Verifying Checksum 2025-12-04T09:50:52.2474594Z a64e1a44f22a: Download complete 2025-12-04T09:50:52.5801149Z 7c40a3faff76: Verifying Checksum 2025-12-04T09:50:52.5801675Z 7c40a3faff76: Download complete 2025-12-04T09:50:53.3839654Z 53c88f1dfeb7: Pull complete 2025-12-04T09:50:53.9285399Z eae668646f44: Pull complete 2025-12-04T09:50:55.7328664Z ff2e6e687b6c: Pull complete 2025-12-04T09:51:00.9764205Z 7c40a3faff76: Pull complete 2025-12-04T09:51:01.3225516Z 967a3b1c8fef: Pull complete 2025-12-04T09:51:01.9751986Z a64e1a44f22a: Pull complete 2025-12-04T09:51:01.9982912Z 52655f8a5bcc: Pull complete 2025-12-04T09:51:02.0124033Z Digest: sha256:3f986299a7b8b44b0d8cf9bda2b22361ce5c3058ef5d7cb17fb7452506680ab0 2025-12-04T09:51:02.0165161Z Status: Downloaded newer image for public.ecr.aws/docker/library/python:3.13 2025-12-04T09:51:09.3741684Z Thu Dec 4 09:51:09 2025 2025-12-04T09:51:09.3742332Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:51:09.3742871Z | NVIDIA-SMI 580.82.07 Driver Version: 580.82.07 CUDA Version: 13.0 | 2025-12-04T09:51:09.3743263Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:51:09.3743634Z | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | 2025-12-04T09:51:09.3744328Z | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | 2025-12-04T09:51:09.3744706Z | | | MIG M. | 2025-12-04T09:51:09.3744951Z |=========================================+========================+======================| 2025-12-04T09:51:09.3863649Z | 0 NVIDIA L4 On | 00000000:35:00.0 Off | 0 | 2025-12-04T09:51:09.3864359Z | N/A 34C P8 12W / 72W | 0MiB / 23034MiB | 0% Default | 2025-12-04T09:51:09.3864885Z | | | N/A | 2025-12-04T09:51:09.3865259Z +-----------------------------------------+------------------------+----------------------+ 2025-12-04T09:51:09.3866982Z 2025-12-04T09:51:09.3867379Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:51:09.3868073Z | Processes: | 2025-12-04T09:51:09.3868855Z | GPU GI CI PID Type Process name GPU Memory | 2025-12-04T09:51:09.3869249Z | ID ID Usage | 2025-12-04T09:51:09.3869565Z |=========================================================================================| 2025-12-04T09:51:09.3872383Z | No running processes found | 2025-12-04T09:51:09.3873147Z +-----------------------------------------------------------------------------------------+ 2025-12-04T09:51:10.8971902Z Command completed after 1 attempt(s). 2025-12-04T09:51:10.9055971Z Prepare all required actions 2025-12-04T09:51:10.9081286Z ##[group]Run ./.github/actions/get-workflow-job-id 2025-12-04T09:51:10.9081539Z with: 2025-12-04T09:51:10.9082238Z github-token: *** 2025-12-04T09:51:10.9082423Z env: 2025-12-04T09:51:10.9082585Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:51:10.9082792Z HAS_NVIDIA_GPU: true 2025-12-04T09:51:10.9083025Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:51:10.9083288Z ##[endgroup] 2025-12-04T09:51:10.9096625Z ##[group]Run set -eux 2025-12-04T09:51:10.9096816Z set -eux 2025-12-04T09:51:10.9097134Z python3 .github/scripts/get_workflow_job_id.py "${GITHUB_RUN_ID}" "${RUNNER_NAME}" 2025-12-04T09:51:10.9109777Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:51:10.9110051Z env: 2025-12-04T09:51:10.9110205Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:51:10.9110401Z HAS_NVIDIA_GPU: true 2025-12-04T09:51:10.9110663Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:51:10.9111050Z GITHUB_TOKEN: *** 2025-12-04T09:51:10.9111214Z ##[endgroup] 2025-12-04T09:51:10.9144393Z + python3 .github/scripts/get_workflow_job_id.py 19922826259 i-07df7d64debf86ede 2025-12-04T09:51:13.2848679Z Setting output job-id=57120265563 2025-12-04T09:51:13.2849470Z Setting output job-name=linux-jammy-cuda12.8-py3.10-gcc11-debug / test (default, 1, 7, linux.g6.4xlarge.experimental.nvidia.gpu, oncall:debug-build, rerun_disabled... 2025-12-04T09:51:13.2963223Z ##[group]Run python3 -m pip install psutil==5.9.8 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84 2025-12-04T09:51:13.2963960Z python3 -m pip install psutil==5.9.8 dataclasses_json==0.6.7 nvidia-ml-py==11.525.84 2025-12-04T09:51:13.2964656Z python3 -m tools.stats.monitor --log-interval "$MONITOR_LOG_INTERVAL" --data-collect-interval "$MONITOR_DATA_COLLECT_INTERVAL" > usage_log.txt 2>&1 & 2025-12-04T09:51:13.2965262Z echo "monitor-script-pid=${!}" >> "${GITHUB_OUTPUT}" 2025-12-04T09:51:13.2973226Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:51:13.2973504Z env: 2025-12-04T09:51:13.2973660Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:51:13.2973845Z HAS_NVIDIA_GPU: true 2025-12-04T09:51:13.2974065Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:51:13.2974313Z JOB_ID: 57120265563 2025-12-04T09:51:13.2974864Z JOB_NAME: linux-jammy-cuda12.8-py3.10-gcc11-debug / test (default, 1, 7, linux.g6.4xlarge.experimental.nvidia.gpu, oncall:debug-build, rerun_disabled... 2025-12-04T09:51:13.2975384Z WORKFLOW_NAME: periodic 2025-12-04T09:51:13.2975573Z WORKFLOW_RUN_ID: 19922826259 2025-12-04T09:51:13.2975764Z MONITOR_LOG_INTERVAL: 5 2025-12-04T09:51:13.2975943Z MONITOR_DATA_COLLECT_INTERVAL: 1 2025-12-04T09:51:13.2976148Z ##[endgroup] 2025-12-04T09:51:13.5708556Z Defaulting to user installation because normal site-packages is not writeable 2025-12-04T09:51:13.9011449Z Collecting psutil==5.9.8 2025-12-04T09:51:13.9173898Z Downloading psutil-5.9.8-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (288 kB) 2025-12-04T09:51:13.9865410Z Collecting dataclasses_json==0.6.7 2025-12-04T09:51:13.9903963Z Downloading dataclasses_json-0.6.7-py3-none-any.whl (28 kB) 2025-12-04T09:51:14.0182189Z Collecting nvidia-ml-py==11.525.84 2025-12-04T09:51:14.0218923Z Downloading nvidia_ml_py-11.525.84-py3-none-any.whl (34 kB) 2025-12-04T09:51:14.0525622Z Collecting typing-inspect<1,>=0.4.0 2025-12-04T09:51:14.0565825Z Downloading typing_inspect-0.9.0-py3-none-any.whl (8.8 kB) 2025-12-04T09:51:14.1525208Z Collecting marshmallow<4.0.0,>=3.18.0 2025-12-04T09:51:14.1558839Z Downloading marshmallow-3.26.1-py3-none-any.whl (50 kB) 2025-12-04T09:51:14.2071532Z Collecting packaging>=17.0 2025-12-04T09:51:14.2106709Z Downloading packaging-25.0-py3-none-any.whl (66 kB) 2025-12-04T09:51:14.2327109Z Collecting mypy-extensions>=0.3.0 2025-12-04T09:51:14.2360546Z Downloading mypy_extensions-1.1.0-py3-none-any.whl (5.0 kB) 2025-12-04T09:51:14.2798655Z Collecting typing-extensions>=3.7.4 2025-12-04T09:51:14.2836070Z Downloading typing_extensions-4.15.0-py3-none-any.whl (44 kB) 2025-12-04T09:51:14.3721575Z Installing collected packages: typing-extensions, packaging, mypy-extensions, typing-inspect, marshmallow, psutil, nvidia-ml-py, dataclasses-json 2025-12-04T09:51:14.6223085Z Successfully installed dataclasses-json-0.6.7 marshmallow-3.26.1 mypy-extensions-1.1.0 nvidia-ml-py-11.525.84 packaging-25.0 psutil-5.9.8 typing-extensions-4.15.0 typing-inspect-0.9.0 2025-12-04T09:51:14.7814058Z Prepare all required actions 2025-12-04T09:51:14.7814420Z Getting action download info 2025-12-04T09:51:15.0522900Z Download action repository 'seemethere/download-artifact-s3@v4' (SHA:1da556a7aa0a088e3153970611f6c432d58e80e6) 2025-12-04T09:51:15.3033090Z Download action repository 'actions/download-artifact@v4' (SHA:d3f86a106a0bac45b974a628896c90dbdf5c8093) 2025-12-04T09:51:15.6982166Z ##[group]Run ./.github/actions/download-build-artifacts 2025-12-04T09:51:15.6982455Z with: 2025-12-04T09:51:15.6982658Z name: linux-jammy-cuda12.8-py3.10-gcc11-debug 2025-12-04T09:51:15.6982903Z s3-bucket: gha-artifacts 2025-12-04T09:51:15.6983090Z env: 2025-12-04T09:51:15.6983239Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:51:15.6983418Z HAS_NVIDIA_GPU: true 2025-12-04T09:51:15.6983669Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:51:15.6983925Z ##[endgroup] 2025-12-04T09:51:15.7009178Z ##[group]Run seemethere/download-artifact-s3@v4 2025-12-04T09:51:15.7009466Z with: 2025-12-04T09:51:15.7009651Z name: linux-jammy-cuda12.8-py3.10-gcc11-debug 2025-12-04T09:51:15.7009902Z s3-bucket: gha-artifacts 2025-12-04T09:51:15.7010098Z region: us-east-1 2025-12-04T09:51:15.7010253Z env: 2025-12-04T09:51:15.7010409Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:51:15.7010604Z HAS_NVIDIA_GPU: true 2025-12-04T09:51:15.7010836Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:51:15.7011098Z ##[endgroup] 2025-12-04T09:51:16.1118000Z (node:60843) NOTE: We are formalizing our plans to enter AWS SDK for JavaScript (v2) into maintenance mode in 2023. 2025-12-04T09:51:16.1118460Z 2025-12-04T09:51:16.1118649Z Please migrate your code to use AWS SDK for JavaScript (v3). 2025-12-04T09:51:16.1186413Z For more information, check the migration guide at https://a.co/7PzMCcy 2025-12-04T09:51:16.1186981Z (Use `node --trace-warnings ...` to show where the warning was created) 2025-12-04T09:51:16.4135649Z Found 1 objects with prefix pytorch/pytorch/19922826259/linux-jammy-cuda12.8-py3.10-gcc11-debug/ 2025-12-04T09:51:16.4136369Z Starting download (1/1): /home/ec2-user/actions-runner/_work/pytorch/pytorch/artifacts.zip 2025-12-04T09:51:24.5287265Z Finished download (1/1): /home/ec2-user/actions-runner/_work/pytorch/pytorch/artifacts.zip 2025-12-04T09:51:24.5293202Z Artifact download has finished successfully 2025-12-04T09:51:24.5559398Z ##[group]Run unzip -o artifacts.zip 2025-12-04T09:51:24.5559669Z unzip -o artifacts.zip 2025-12-04T09:51:24.5568374Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:51:24.5568748Z env: 2025-12-04T09:51:24.5568910Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:51:24.5569110Z HAS_NVIDIA_GPU: true 2025-12-04T09:51:24.5569337Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:51:24.5569587Z ##[endgroup] 2025-12-04T09:51:24.5649998Z Archive: artifacts.zip 2025-12-04T09:51:24.5651153Z creating: dist/ 2025-12-04T09:51:26.3876416Z inflating: dist/torch-2.10.0a0+gitffd9b0f-cp310-cp310-linux_x86_64.whl 2025-12-04T09:51:26.3992792Z inflating: dist/.ninja_log 2025-12-04T09:51:26.3993527Z creating: build/custom_test_artifacts/ 2025-12-04T09:51:26.3993942Z creating: build/custom_test_artifacts/custom-op-build/ 2025-12-04T09:51:26.3994406Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/ 2025-12-04T09:51:26.3994939Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/pkgRedirects/ 2025-12-04T09:51:26.4002268Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeConfigureLog.yaml 2025-12-04T09:51:26.4002754Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/ 2025-12-04T09:51:26.4003447Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeSystem.cmake 2025-12-04T09:51:26.4003965Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdC/ 2025-12-04T09:51:26.4004900Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdC/tmp/ 2025-12-04T09:51:26.4007305Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdC/CMakeCCompilerId.c 2025-12-04T09:51:26.4008567Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdC/a.out 2025-12-04T09:51:26.4009611Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeCCompiler.cmake 2025-12-04T09:51:26.4010269Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCXX/ 2025-12-04T09:51:26.4010891Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCXX/tmp/ 2025-12-04T09:51:26.4013397Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCXX/CMakeCXXCompilerId.cpp 2025-12-04T09:51:26.4014698Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCXX/a.out 2025-12-04T09:51:26.4015867Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeCXXCompiler.cmake 2025-12-04T09:51:26.4017619Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_C.bin 2025-12-04T09:51:26.4019503Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CXX.bin 2025-12-04T09:51:26.4020094Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/ 2025-12-04T09:51:26.4020607Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/ 2025-12-04T09:51:26.4072539Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp4.ii 2025-12-04T09:51:26.4125107Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.cpp 2025-12-04T09:51:26.4126083Z extracting: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.module_id 2025-12-04T09:51:26.4182125Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp1.ii 2025-12-04T09:51:26.4183059Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.c 2025-12-04T09:51:26.4184165Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.gpu 2025-12-04T09:51:26.4185104Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.stub.c 2025-12-04T09:51:26.4186009Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.ptx 2025-12-04T09:51:26.4186884Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.sm_52.cubin 2025-12-04T09:51:26.4188190Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin 2025-12-04T09:51:26.4189131Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin.c 2025-12-04T09:51:26.4190390Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.o 2025-12-04T09:51:26.4191280Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.sm_52.cubin 2025-12-04T09:51:26.4192102Z extracting: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.reg.c 2025-12-04T09:51:26.4193034Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.fatbin 2025-12-04T09:51:26.4194091Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.fatbin.c 2025-12-04T09:51:26.4195334Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.o 2025-12-04T09:51:26.4197830Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/CMakeCUDACompilerId.cu 2025-12-04T09:51:26.4262225Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CompilerIdCUDA/a.out 2025-12-04T09:51:26.4263148Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeCUDACompiler.cmake 2025-12-04T09:51:26.4327955Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CUDA.bin 2025-12-04T09:51:26.4328670Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeScratch/ 2025-12-04T09:51:26.4329225Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeTmp/ 2025-12-04T09:51:26.4329784Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/cmake.check_cache 2025-12-04T09:51:26.4330388Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/ 2025-12-04T09:51:26.4331042Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/compiler_depend.ts 2025-12-04T09:51:26.4331787Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/compiler_depend.make 2025-12-04T09:51:26.4332478Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/depend.make 2025-12-04T09:51:26.4333127Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/link.txt 2025-12-04T09:51:26.4333988Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/cmake_clean.cmake 2025-12-04T09:51:26.4335759Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/build.make 2025-12-04T09:51:26.4336523Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/DependInfo.cmake 2025-12-04T09:51:26.4337288Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/flags.make 2025-12-04T09:51:26.4338342Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/progress.make 2025-12-04T09:51:26.4357303Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/op.cpp.o.d 2025-12-04T09:51:26.4532157Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/custom_ops.dir/op.cpp.o 2025-12-04T09:51:26.4532784Z creating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/ 2025-12-04T09:51:26.4533466Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/compiler_depend.ts 2025-12-04T09:51:26.4534264Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/compiler_depend.make 2025-12-04T09:51:26.4535134Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/depend.make 2025-12-04T09:51:26.4536072Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/link.txt 2025-12-04T09:51:26.4536802Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/cmake_clean.cmake 2025-12-04T09:51:26.4537910Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/build.make 2025-12-04T09:51:26.4538645Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/DependInfo.cmake 2025-12-04T09:51:26.4539440Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/flags.make 2025-12-04T09:51:26.4540423Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/progress.make 2025-12-04T09:51:26.4559571Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/test_custom_ops.cpp.o.d 2025-12-04T09:51:26.4630992Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/test_custom_ops.dir/test_custom_ops.cpp.o 2025-12-04T09:51:26.4632005Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/CMakeDirectoryInformation.cmake 2025-12-04T09:51:26.4632725Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/TargetDirectories.txt 2025-12-04T09:51:26.4633477Z extracting: build/custom_test_artifacts/custom-op-build/CMakeFiles/progress.marks 2025-12-04T09:51:26.4634545Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/Makefile2 2025-12-04T09:51:26.4636307Z inflating: build/custom_test_artifacts/custom-op-build/CMakeFiles/Makefile.cmake 2025-12-04T09:51:26.4637361Z inflating: build/custom_test_artifacts/custom-op-build/detect_cuda_version.cc 2025-12-04T09:51:26.4640015Z inflating: build/custom_test_artifacts/custom-op-build/CMakeCache.txt 2025-12-04T09:51:26.4641019Z inflating: build/custom_test_artifacts/custom-op-build/Makefile 2025-12-04T09:51:26.4641902Z inflating: build/custom_test_artifacts/custom-op-build/cmake_install.cmake 2025-12-04T09:51:26.4792437Z inflating: build/custom_test_artifacts/custom-op-build/libcustom_ops.so 2025-12-04T09:51:26.4842579Z inflating: build/custom_test_artifacts/custom-op-build/test_custom_ops 2025-12-04T09:51:26.4843050Z creating: build/custom_test_artifacts/jit-hook-build/ 2025-12-04T09:51:26.4843469Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/ 2025-12-04T09:51:26.4843962Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/pkgRedirects/ 2025-12-04T09:51:26.4851057Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeConfigureLog.yaml 2025-12-04T09:51:26.4851631Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/ 2025-12-04T09:51:26.4852192Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeSystem.cmake 2025-12-04T09:51:26.4852806Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdC/ 2025-12-04T09:51:26.4853414Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdC/tmp/ 2025-12-04T09:51:26.4856161Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdC/CMakeCCompilerId.c 2025-12-04T09:51:26.4857398Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdC/a.out 2025-12-04T09:51:26.4858434Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeCCompiler.cmake 2025-12-04T09:51:26.4858942Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCXX/ 2025-12-04T09:51:26.4859420Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCXX/tmp/ 2025-12-04T09:51:26.4861972Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCXX/CMakeCXXCompilerId.cpp 2025-12-04T09:51:26.4863512Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCXX/a.out 2025-12-04T09:51:26.4864585Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeCXXCompiler.cmake 2025-12-04T09:51:26.4866313Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_C.bin 2025-12-04T09:51:26.4868226Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CXX.bin 2025-12-04T09:51:26.4868783Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/ 2025-12-04T09:51:26.4869270Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/ 2025-12-04T09:51:26.4921006Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp4.ii 2025-12-04T09:51:26.4973737Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.cpp 2025-12-04T09:51:26.4974657Z extracting: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.module_id 2025-12-04T09:51:26.5030884Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp1.ii 2025-12-04T09:51:26.5031806Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.c 2025-12-04T09:51:26.5032709Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.gpu 2025-12-04T09:51:26.5033772Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.stub.c 2025-12-04T09:51:26.5034700Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.ptx 2025-12-04T09:51:26.5035576Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.sm_52.cubin 2025-12-04T09:51:26.5036486Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin 2025-12-04T09:51:26.5037570Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin.c 2025-12-04T09:51:26.5038819Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.o 2025-12-04T09:51:26.5039660Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.sm_52.cubin 2025-12-04T09:51:26.5040474Z extracting: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.reg.c 2025-12-04T09:51:26.5041404Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.fatbin 2025-12-04T09:51:26.5042439Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.fatbin.c 2025-12-04T09:51:26.5043513Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.o 2025-12-04T09:51:26.5046066Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/CMakeCUDACompilerId.cu 2025-12-04T09:51:26.5110570Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CompilerIdCUDA/a.out 2025-12-04T09:51:26.5111529Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeCUDACompiler.cmake 2025-12-04T09:51:26.5176925Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CUDA.bin 2025-12-04T09:51:26.5177646Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeScratch/ 2025-12-04T09:51:26.5178180Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeTmp/ 2025-12-04T09:51:26.5178730Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/cmake.check_cache 2025-12-04T09:51:26.5179517Z creating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/ 2025-12-04T09:51:26.5180187Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/compiler_depend.ts 2025-12-04T09:51:26.5180948Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/compiler_depend.make 2025-12-04T09:51:26.5181672Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/depend.make 2025-12-04T09:51:26.5182353Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/link.txt 2025-12-04T09:51:26.5183043Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/cmake_clean.cmake 2025-12-04T09:51:26.5184001Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/build.make 2025-12-04T09:51:26.5184784Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/DependInfo.cmake 2025-12-04T09:51:26.5185541Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/flags.make 2025-12-04T09:51:26.5186795Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/progress.make 2025-12-04T09:51:26.5205503Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/test_jit_hooks.cpp.o.d 2025-12-04T09:51:26.5261340Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/test_jit_hooks.dir/test_jit_hooks.cpp.o 2025-12-04T09:51:26.5262229Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/CMakeDirectoryInformation.cmake 2025-12-04T09:51:26.5263045Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/TargetDirectories.txt 2025-12-04T09:51:26.5263698Z extracting: build/custom_test_artifacts/jit-hook-build/CMakeFiles/progress.marks 2025-12-04T09:51:26.5264801Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/Makefile2 2025-12-04T09:51:26.5266566Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeFiles/Makefile.cmake 2025-12-04T09:51:26.5267413Z inflating: build/custom_test_artifacts/jit-hook-build/detect_cuda_version.cc 2025-12-04T09:51:26.5270047Z inflating: build/custom_test_artifacts/jit-hook-build/CMakeCache.txt 2025-12-04T09:51:26.5270801Z inflating: build/custom_test_artifacts/jit-hook-build/Makefile 2025-12-04T09:51:26.5271651Z inflating: build/custom_test_artifacts/jit-hook-build/cmake_install.cmake 2025-12-04T09:51:26.5310020Z inflating: build/custom_test_artifacts/jit-hook-build/test_jit_hooks 2025-12-04T09:51:26.5310507Z creating: build/custom_test_artifacts/custom-backend-build/ 2025-12-04T09:51:26.5310968Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/ 2025-12-04T09:51:26.5311517Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/pkgRedirects/ 2025-12-04T09:51:26.5318686Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeConfigureLog.yaml 2025-12-04T09:51:26.5319322Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/ 2025-12-04T09:51:26.5319953Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeSystem.cmake 2025-12-04T09:51:26.5320629Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdC/ 2025-12-04T09:51:26.5321297Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdC/tmp/ 2025-12-04T09:51:26.5323999Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdC/CMakeCCompilerId.c 2025-12-04T09:51:26.5325288Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdC/a.out 2025-12-04T09:51:26.5326281Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeCCompiler.cmake 2025-12-04T09:51:26.5326959Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCXX/ 2025-12-04T09:51:26.5327825Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCXX/tmp/ 2025-12-04T09:51:26.5329781Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCXX/CMakeCXXCompilerId.cpp 2025-12-04T09:51:26.5331241Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCXX/a.out 2025-12-04T09:51:26.5332376Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeCXXCompiler.cmake 2025-12-04T09:51:26.5334134Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_C.bin 2025-12-04T09:51:26.5335983Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CXX.bin 2025-12-04T09:51:26.5336583Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/ 2025-12-04T09:51:26.5337116Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/ 2025-12-04T09:51:26.5389200Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp4.ii 2025-12-04T09:51:26.5441392Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.cpp 2025-12-04T09:51:26.5442383Z extracting: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.module_id 2025-12-04T09:51:26.5498390Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cpp1.ii 2025-12-04T09:51:26.5499356Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.c 2025-12-04T09:51:26.5500530Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.gpu 2025-12-04T09:51:26.5501513Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.cudafe1.stub.c 2025-12-04T09:51:26.5502481Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.ptx 2025-12-04T09:51:26.5503409Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.sm_52.cubin 2025-12-04T09:51:26.5504343Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin 2025-12-04T09:51:26.5505270Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.fatbin.c 2025-12-04T09:51:26.5506551Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/CMakeCUDACompilerId.o 2025-12-04T09:51:26.5507494Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.sm_52.cubin 2025-12-04T09:51:26.5508339Z extracting: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.reg.c 2025-12-04T09:51:26.5509167Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.fatbin 2025-12-04T09:51:26.5510217Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.fatbin.c 2025-12-04T09:51:26.5511273Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/tmp/a_dlink.o 2025-12-04T09:51:26.5513835Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/CMakeCUDACompilerId.cu 2025-12-04T09:51:26.5578438Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CompilerIdCUDA/a.out 2025-12-04T09:51:26.5579322Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeCUDACompiler.cmake 2025-12-04T09:51:26.5644004Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/3.31.6/CMakeDetermineCompilerABI_CUDA.bin 2025-12-04T09:51:26.5644761Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeScratch/ 2025-12-04T09:51:26.5645336Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeTmp/ 2025-12-04T09:51:26.5645928Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/cmake.check_cache 2025-12-04T09:51:26.5646545Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/ 2025-12-04T09:51:26.5647256Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/compiler_depend.ts 2025-12-04T09:51:26.5648078Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/compiler_depend.make 2025-12-04T09:51:26.5648791Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/depend.make 2025-12-04T09:51:26.5649371Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/link.txt 2025-12-04T09:51:26.5650143Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/cmake_clean.cmake 2025-12-04T09:51:26.5651138Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/build.make 2025-12-04T09:51:26.5651895Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/DependInfo.cmake 2025-12-04T09:51:26.5652724Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/flags.make 2025-12-04T09:51:26.5653704Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/progress.make 2025-12-04T09:51:26.5658369Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/custom_backend.cpp.o.d 2025-12-04T09:51:26.5763366Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/custom_backend.dir/custom_backend.cpp.o 2025-12-04T09:51:26.5764123Z creating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/ 2025-12-04T09:51:26.5764873Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/compiler_depend.ts 2025-12-04T09:51:26.5765712Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/compiler_depend.make 2025-12-04T09:51:26.5766520Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/depend.make 2025-12-04T09:51:26.5767281Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/link.txt 2025-12-04T09:51:26.5768055Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/cmake_clean.cmake 2025-12-04T09:51:26.5768934Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/build.make 2025-12-04T09:51:26.5769748Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/DependInfo.cmake 2025-12-04T09:51:26.5770551Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/flags.make 2025-12-04T09:51:26.5771550Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/progress.make 2025-12-04T09:51:26.5790530Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/test_custom_backend.cpp.o.d 2025-12-04T09:51:26.5838875Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/test_custom_backend.dir/test_custom_backend.cpp.o 2025-12-04T09:51:26.5839837Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/CMakeDirectoryInformation.cmake 2025-12-04T09:51:26.5840599Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/TargetDirectories.txt 2025-12-04T09:51:26.5841478Z extracting: build/custom_test_artifacts/custom-backend-build/CMakeFiles/progress.marks 2025-12-04T09:51:26.5842387Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/Makefile2 2025-12-04T09:51:26.5844291Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeFiles/Makefile.cmake 2025-12-04T09:51:26.5844897Z inflating: build/custom_test_artifacts/custom-backend-build/detect_cuda_version.cc 2025-12-04T09:51:26.5847558Z inflating: build/custom_test_artifacts/custom-backend-build/CMakeCache.txt 2025-12-04T09:51:26.5848434Z inflating: build/custom_test_artifacts/custom-backend-build/Makefile 2025-12-04T09:51:26.5849357Z inflating: build/custom_test_artifacts/custom-backend-build/cmake_install.cmake 2025-12-04T09:51:26.5938144Z inflating: build/custom_test_artifacts/custom-backend-build/libcustom_backend.so 2025-12-04T09:51:26.5974088Z inflating: build/custom_test_artifacts/custom-backend-build/test_custom_backend 2025-12-04T09:51:26.5974529Z creating: build/lib/ 2025-12-04T09:51:26.6071133Z inflating: build/lib/libprotobuf-lite.a 2025-12-04T09:51:26.6513813Z inflating: build/lib/libprotobuf.a 2025-12-04T09:51:26.6948767Z inflating: build/lib/libprotoc.a 2025-12-04T09:51:26.6958875Z inflating: build/lib/libpthreadpool.a 2025-12-04T09:51:26.6966274Z inflating: build/lib/libcpuinfo.a 2025-12-04T09:51:26.6973249Z inflating: build/lib/libcpuinfo_internals.a 2025-12-04T09:51:26.6974334Z inflating: build/lib/libclog.a 2025-12-04T09:51:26.6992456Z inflating: build/lib/libpytorch_qnnpack.a 2025-12-04T09:51:26.6994803Z inflating: build/lib/libnnpack_reference_layers.a 2025-12-04T09:51:26.7011003Z inflating: build/lib/libnnpack.a 2025-12-04T09:51:26.7260011Z inflating: build/lib/libmicrokernels-prod.a 2025-12-04T09:51:26.8373241Z inflating: build/lib/libmicrokernels-all.a 2025-12-04T09:51:26.8434020Z inflating: build/lib/libgtest.a 2025-12-04T09:51:26.8449176Z inflating: build/lib/libgmock.a 2025-12-04T09:51:26.8450095Z inflating: build/lib/libgtest_main.a 2025-12-04T09:51:26.8451105Z inflating: build/lib/libgmock_main.a 2025-12-04T09:51:26.8547459Z inflating: build/lib/libXNNPACK.a 2025-12-04T09:51:26.8618314Z inflating: build/lib/libbenchmark.a 2025-12-04T09:51:26.8619149Z inflating: build/lib/libbenchmark_main.a 2025-12-04T09:51:26.8626489Z inflating: build/lib/libittnotify.a 2025-12-04T09:51:26.8627372Z inflating: build/lib/libjitprofiling.a 2025-12-04T09:51:26.8692741Z inflating: build/lib/libasmjit.a 2025-12-04T09:51:26.9887240Z inflating: build/lib/libfbgemm.a 2025-12-04T09:51:26.9916644Z inflating: build/lib/libtensorpipe_uv.a 2025-12-04T09:51:27.0436289Z inflating: build/lib/libtensorpipe.a 2025-12-04T09:51:27.0666066Z inflating: build/lib/libtensorpipe_cuda.a 2025-12-04T09:51:27.0782836Z inflating: build/lib/libgloo.a 2025-12-04T09:51:27.0840167Z inflating: build/lib/libonnx_proto.a 2025-12-04T09:51:27.1227082Z inflating: build/lib/libgloo_cuda.a 2025-12-04T09:51:27.1877555Z inflating: build/lib/libonnx.a 2025-12-04T09:51:28.0868274Z inflating: build/lib/libdnnl.a 2025-12-04T09:51:28.0885883Z inflating: build/lib/libfmt.a 2025-12-04T09:51:28.1315960Z inflating: build/lib/libkineto.a 2025-12-04T09:51:28.1420518Z inflating: build/lib/libc10.so 2025-12-04T09:51:28.1464604Z inflating: build/lib/libc10_cuda.so 2025-12-04T09:51:28.1466507Z inflating: build/lib/libcaffe2_nvrtc.so 2025-12-04T09:51:28.1468141Z inflating: build/lib/libtorch_global_deps.so 2025-12-04T09:51:30.8657271Z inflating: build/lib/libtorch_cpu.so 2025-12-04T09:51:30.9387780Z inflating: build/lib/libtorch_nvshmem.so 2025-12-04T09:51:33.5061740Z inflating: build/lib/libtorch_cuda.so 2025-12-04T09:51:33.5065219Z inflating: build/lib/libtorch.so 2025-12-04T09:51:33.5110394Z inflating: build/lib/libtorch_cuda_linalg.so 2025-12-04T09:51:33.5176206Z inflating: build/lib/libtorchbind_test.so 2025-12-04T09:51:33.5194859Z inflating: build/lib/libjitbackend_test.so 2025-12-04T09:51:33.5218204Z inflating: build/lib/libbackend_with_compiler.so 2025-12-04T09:51:33.5243632Z inflating: build/lib/libaoti_custom_ops.so 2025-12-04T09:51:33.5247784Z inflating: build/lib/libc10d_cuda_test.so 2025-12-04T09:51:33.5251818Z inflating: build/lib/libshm.so 2025-12-04T09:51:33.7397918Z inflating: build/lib/libtorch_python.so 2025-12-04T09:51:33.7432517Z inflating: build/lib/libnnapi_backend.so 2025-12-04T09:51:33.7433037Z creating: build/bin/ 2025-12-04T09:51:33.7850364Z inflating: build/bin/protoc-3.13.0.0 2025-12-04T09:51:33.8268692Z inflating: build/bin/protoc 2025-12-04T09:51:33.8320134Z inflating: build/bin/c10_AllocatorConfig_test 2025-12-04T09:51:33.8368758Z inflating: build/bin/c10_CompileTimeFunctionPointer_test 2025-12-04T09:51:33.8419139Z inflating: build/bin/c10_DeviceGuard_test 2025-12-04T09:51:33.8469411Z inflating: build/bin/c10_Device_test 2025-12-04T09:51:33.8527414Z inflating: build/bin/c10_DispatchKeySet_test 2025-12-04T09:51:33.8580889Z inflating: build/bin/c10_Scalar_test 2025-12-04T09:51:33.8628225Z inflating: build/bin/c10_StreamGuard_test 2025-12-04T09:51:33.8683940Z inflating: build/bin/c10_SymInt_test 2025-12-04T09:51:33.8738252Z inflating: build/bin/c10_InlineStreamGuard_test 2025-12-04T09:51:33.8806118Z inflating: build/bin/c10_cow_test 2025-12-04T09:51:33.8860062Z inflating: build/bin/c10_SizesAndStrides_test 2025-12-04T09:51:33.8913760Z inflating: build/bin/c10_InlineDeviceGuard_test 2025-12-04T09:51:33.8964876Z inflating: build/bin/c10_Bitset_test 2025-12-04T09:51:33.9012984Z inflating: build/bin/c10_ArrayRef_test 2025-12-04T09:51:33.9060811Z inflating: build/bin/c10_ConstexprCrc_test 2025-12-04T09:51:33.9108856Z inflating: build/bin/c10_DeadlockDetection_test 2025-12-04T09:51:33.9162950Z inflating: build/bin/c10_LeftRight_test 2025-12-04T09:51:33.9212268Z inflating: build/bin/c10_Half_test 2025-12-04T09:51:33.9263433Z inflating: build/bin/c10_IntrusiveList_test 2025-12-04T09:51:33.9317112Z inflating: build/bin/c10_NetworkFlow_test 2025-12-04T09:51:33.9372004Z inflating: build/bin/c10_Enumerate_test 2025-12-04T09:51:33.9420357Z inflating: build/bin/c10_Synchronized_test 2025-12-04T09:51:33.9468679Z inflating: build/bin/c10_Semaphore_test 2025-12-04T09:51:33.9522033Z inflating: build/bin/c10_ThreadLocal_test 2025-12-04T09:51:33.9572079Z inflating: build/bin/c10_accumulate_test 2025-12-04T09:51:33.9622050Z inflating: build/bin/c10_TypeIndex_test 2025-12-04T09:51:33.9675758Z inflating: build/bin/c10_bfloat16_test 2025-12-04T09:51:33.9724761Z inflating: build/bin/c10_bit_cast_test 2025-12-04T09:51:33.9773302Z inflating: build/bin/c10_error_test 2025-12-04T09:51:33.9827775Z inflating: build/bin/c10_complex_math_test 2025-12-04T09:51:33.9878306Z inflating: build/bin/c10_exception_test 2025-12-04T09:51:33.9931671Z inflating: build/bin/c10_complex_test 2025-12-04T09:51:33.9980118Z inflating: build/bin/c10_flags_test 2025-12-04T09:51:34.0028963Z inflating: build/bin/c10_generic_math_test 2025-12-04T09:51:34.0078430Z inflating: build/bin/c10_irange_test 2025-12-04T09:51:34.0232289Z inflating: build/bin/c10_intrusive_ptr_test 2025-12-04T09:51:34.0284521Z inflating: build/bin/c10_lazy_test 2025-12-04T09:51:34.0339418Z inflating: build/bin/c10_logging_test 2025-12-04T09:51:34.0388644Z inflating: build/bin/c10_nofatal_test 2025-12-04T09:51:34.0460212Z inflating: build/bin/c10_optional_test 2025-12-04T09:51:34.0519321Z inflating: build/bin/c10_ordered_preserving_dict_test 2025-12-04T09:51:34.0570599Z inflating: build/bin/c10_registry_test 2025-12-04T09:51:34.0624443Z inflating: build/bin/c10_string_util_test 2025-12-04T09:51:34.0769029Z inflating: build/bin/c10_small_vector_test 2025-12-04T09:51:34.0819021Z inflating: build/bin/c10_ssize_test 2025-12-04T09:51:34.0867664Z inflating: build/bin/c10_tempfile_test 2025-12-04T09:51:34.0912327Z inflating: build/bin/c10_intrusive_ptr_benchmark 2025-12-04T09:51:34.0959821Z inflating: build/bin/c10_string_view_test 2025-12-04T09:51:34.1014313Z inflating: build/bin/c10_typeid_test 2025-12-04T09:51:34.1071994Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_from_2_processes 2025-12-04T09:51:34.1128965Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_1_var_test 2025-12-04T09:51:34.1185992Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_catches_thread_and_block_and_device 2025-12-04T09:51:34.1233795Z inflating: build/bin/c10_cuda_CUDATest 2025-12-04T09:51:34.1290653Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_catches_stream 2025-12-04T09:51:34.1347640Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_multiple_writes_from_multiple_blocks 2025-12-04T09:51:34.1407112Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_multiple_writes_from_blocks_and_threads 2025-12-04T09:51:34.1463774Z inflating: build/bin/c10_cuda_CUDAAssertionsTest_multiple_writes_from_same_block 2025-12-04T09:51:34.2039354Z inflating: build/bin/vec_test_all_types_DEFAULT 2025-12-04T09:51:34.2625778Z inflating: build/bin/vec_test_all_types_AVX512 2025-12-04T09:51:34.3211348Z inflating: build/bin/vec_test_all_types_AVX2 2025-12-04T09:51:34.3259319Z inflating: build/bin/test_vec_half_DEFAULT 2025-12-04T09:51:34.3350801Z inflating: build/bin/test_aoti_abi_check 2025-12-04T09:51:34.3399294Z inflating: build/bin/test_vec_half_AVX512 2025-12-04T09:51:34.3447893Z inflating: build/bin/test_vec_half_AVX2 2025-12-04T09:51:34.3498664Z inflating: build/bin/BackoffTest 2025-12-04T09:51:34.3551087Z inflating: build/bin/FileStoreTest 2025-12-04T09:51:34.3606163Z inflating: build/bin/TCPStoreTest 2025-12-04T09:51:34.3658713Z inflating: build/bin/HashStoreTest 2025-12-04T09:51:34.3729396Z inflating: build/bin/Dict_test 2025-12-04T09:51:34.3792644Z inflating: build/bin/Dimname_test 2025-12-04T09:51:34.3855451Z inflating: build/bin/MaybeOwned_test 2025-12-04T09:51:34.3910696Z inflating: build/bin/NamedTensor_test 2025-12-04T09:51:34.3967889Z inflating: build/bin/apply_utils_test 2025-12-04T09:51:34.4024192Z inflating: build/bin/atest 2025-12-04T09:51:34.4086372Z inflating: build/bin/basic 2025-12-04T09:51:34.4138757Z inflating: build/bin/broadcast_test 2025-12-04T09:51:34.4189415Z inflating: build/bin/cpu_allocator_test 2025-12-04T09:51:34.4244933Z inflating: build/bin/cpu_generator_test 2025-12-04T09:51:34.4297299Z inflating: build/bin/cpu_profiling_allocator_test 2025-12-04T09:51:34.4384771Z inflating: build/bin/cpu_rng_test 2025-12-04T09:51:34.4434007Z inflating: build/bin/dlconvertor_test 2025-12-04T09:51:34.4490517Z inflating: build/bin/extension_backend_test 2025-12-04T09:51:34.4543651Z inflating: build/bin/half_test 2025-12-04T09:51:34.4637660Z inflating: build/bin/ivalue_test 2025-12-04T09:51:34.4686553Z inflating: build/bin/lazy_tensor_test 2025-12-04T09:51:34.4739460Z inflating: build/bin/math_kernel_test 2025-12-04T09:51:34.4793173Z inflating: build/bin/memory_format_test 2025-12-04T09:51:34.4845557Z inflating: build/bin/memory_overlapping_test 2025-12-04T09:51:34.4897224Z inflating: build/bin/mobile_memory_cleanup 2025-12-04T09:51:34.4951617Z inflating: build/bin/native_test 2025-12-04T09:51:34.5000750Z inflating: build/bin/operator_name_test 2025-12-04T09:51:34.5049608Z inflating: build/bin/operators_test 2025-12-04T09:51:34.5100165Z inflating: build/bin/packedtensoraccessor_test 2025-12-04T09:51:34.5165518Z inflating: build/bin/pow_test 2025-12-04T09:51:34.5221095Z inflating: build/bin/quantized_test 2025-12-04T09:51:34.5269550Z inflating: build/bin/reduce_ops_test 2025-12-04T09:51:34.5318628Z inflating: build/bin/reportMemoryUsage_test 2025-12-04T09:51:34.5373606Z inflating: build/bin/scalar_tensor_test 2025-12-04T09:51:34.5429808Z inflating: build/bin/scalar_test 2025-12-04T09:51:34.5480714Z inflating: build/bin/StorageUtils_test 2025-12-04T09:51:34.5531896Z inflating: build/bin/stride_properties_test 2025-12-04T09:51:34.5605569Z inflating: build/bin/tensor_iterator_test 2025-12-04T09:51:34.5658384Z inflating: build/bin/type_ptr_test 2025-12-04T09:51:34.5707326Z inflating: build/bin/thread_init_test 2025-12-04T09:51:34.5761086Z inflating: build/bin/test_parallel 2025-12-04T09:51:34.5817354Z inflating: build/bin/type_test 2025-12-04T09:51:34.5868612Z inflating: build/bin/undefined_tensor_test 2025-12-04T09:51:34.5916187Z inflating: build/bin/verify_api_visibility 2025-12-04T09:51:34.5984233Z inflating: build/bin/legacy_vmap_test 2025-12-04T09:51:34.6033898Z inflating: build/bin/weakref_test 2025-12-04T09:51:34.6083524Z inflating: build/bin/wrapdim_test 2025-12-04T09:51:34.6134295Z inflating: build/bin/xla_tensor_test 2025-12-04T09:51:34.6191712Z inflating: build/bin/IListRef_test 2025-12-04T09:51:34.6291155Z inflating: build/bin/List_test 2025-12-04T09:51:34.6355621Z inflating: build/bin/KernelFunction_test 2025-12-04T09:51:34.6469060Z inflating: build/bin/kernel_function_legacy_test 2025-12-04T09:51:34.6561181Z inflating: build/bin/kernel_function_test 2025-12-04T09:51:34.6680777Z inflating: build/bin/kernel_lambda_legacy_test 2025-12-04T09:51:34.6778800Z inflating: build/bin/kernel_lambda_test 2025-12-04T09:51:34.6837518Z inflating: build/bin/kernel_stackbased_test 2025-12-04T09:51:34.6929122Z inflating: build/bin/make_boxed_from_unboxed_functor_test 2025-12-04T09:51:34.6978255Z inflating: build/bin/CppSignature_test 2025-12-04T09:51:34.7032169Z inflating: build/bin/backend_fallback_test 2025-12-04T09:51:34.7079325Z inflating: build/bin/op_allowlist_test 2025-12-04T09:51:34.7368437Z inflating: build/bin/op_registration_test 2025-12-04T09:51:34.7431665Z inflating: build/bin/inline_container_test 2025-12-04T09:51:34.7484219Z inflating: build/bin/cuda_allocator_test 2025-12-04T09:51:34.7534924Z inflating: build/bin/cuda_apply_test 2025-12-04T09:51:34.7604446Z inflating: build/bin/cuda_atomic_ops_test 2025-12-04T09:51:34.7659754Z inflating: build/bin/cuda_caching_host_allocator_test 2025-12-04T09:51:34.7754765Z inflating: build/bin/cuda_complex_math_test 2025-12-04T09:51:34.7823606Z inflating: build/bin/cuda_complex_test 2025-12-04T09:51:34.7890626Z inflating: build/bin/cuda_cub_test 2025-12-04T09:51:34.7941757Z inflating: build/bin/cuda_cublas_handle_pool_test 2025-12-04T09:51:34.7989983Z inflating: build/bin/cuda_device_test 2025-12-04T09:51:34.8074185Z inflating: build/bin/cuda_distributions_test 2025-12-04T09:51:34.8124429Z inflating: build/bin/cuda_dlconvertor_test 2025-12-04T09:51:34.8176900Z inflating: build/bin/cuda_event_test 2025-12-04T09:51:34.8224723Z inflating: build/bin/cuda_exchange_device_test 2025-12-04T09:51:34.8300000Z inflating: build/bin/cuda_generator_test 2025-12-04T09:51:34.8364118Z inflating: build/bin/cuda_half_test 2025-12-04T09:51:34.8421857Z inflating: build/bin/cuda_integer_divider_test 2025-12-04T09:51:34.8485388Z inflating: build/bin/cuda_optional_test 2025-12-04T09:51:34.8554982Z inflating: build/bin/cuda_packedtensoraccessor_test 2025-12-04T09:51:34.8605712Z inflating: build/bin/cuda_reportMemoryUsage_test 2025-12-04T09:51:34.8653327Z inflating: build/bin/cuda_allocatorTraceTracker_test 2025-12-04T09:51:34.8712341Z inflating: build/bin/cuda_stream_test 2025-12-04T09:51:34.8776436Z inflating: build/bin/cuda_vectorized_test 2025-12-04T09:51:34.8824717Z inflating: build/bin/cuda_cudnn_test 2025-12-04T09:51:34.9139789Z inflating: build/bin/test_lazy 2025-12-04T09:51:35.0156908Z inflating: build/bin/test_jit 2025-12-04T09:51:35.0221867Z inflating: build/bin/ProcessGroupGlooTest 2025-12-04T09:51:35.0284143Z inflating: build/bin/ProcessGroupNCCLTest 2025-12-04T09:51:35.0338563Z inflating: build/bin/ProcessGroupGlooAsyncTest 2025-12-04T09:51:35.0398327Z inflating: build/bin/ProcessGroupNCCLErrorsTest 2025-12-04T09:51:35.0412198Z inflating: build/bin/ProcessGroupMPITest 2025-12-04T09:51:35.0416288Z inflating: build/bin/example_allreduce 2025-12-04T09:51:35.0470594Z inflating: build/bin/test_dist_autograd 2025-12-04T09:51:35.0537049Z inflating: build/bin/test_cpp_rpc 2025-12-04T09:51:35.1572666Z inflating: build/bin/test_api 2025-12-04T09:51:35.1575042Z inflating: build/bin/parallel_benchmark 2025-12-04T09:51:35.1578562Z inflating: build/bin/torch_shm_manager 2025-12-04T09:51:35.1579058Z creating: .additional_ci_files/ 2025-12-04T09:51:35.1634902Z inflating: .additional_ci_files/test-times.json 2025-12-04T09:51:35.1838778Z inflating: .additional_ci_files/test-class-times.json 2025-12-04T09:51:35.1877784Z ##[group]Run rm artifacts.zip 2025-12-04T09:51:35.1878010Z rm artifacts.zip 2025-12-04T09:51:35.1886225Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:51:35.1886497Z env: 2025-12-04T09:51:35.1886648Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:51:35.1887034Z HAS_NVIDIA_GPU: true 2025-12-04T09:51:35.1887280Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:51:35.1887538Z ##[endgroup] 2025-12-04T09:51:35.2899738Z ##[group]Run df -H 2025-12-04T09:51:35.2899983Z df -H 2025-12-04T09:51:35.2907525Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:51:35.2907802Z env: 2025-12-04T09:51:35.2907961Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:51:35.2908151Z HAS_NVIDIA_GPU: true 2025-12-04T09:51:35.2908381Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:51:35.2908654Z ##[endgroup] 2025-12-04T09:51:35.2959880Z Filesystem Size Used Avail Use% Mounted on 2025-12-04T09:51:35.2960376Z devtmpfs 4.2M 0 4.2M 0% /dev 2025-12-04T09:51:35.2960815Z tmpfs 33G 0 33G 0% /dev/shm 2025-12-04T09:51:35.2961273Z tmpfs 13G 779k 13G 1% /run 2025-12-04T09:51:35.2961671Z /dev/nvme0n1p1 161G 55G 107G 34% / 2025-12-04T09:51:35.2962081Z tmpfs 33G 17k 33G 1% /tmp 2025-12-04T09:51:35.2962512Z /dev/nvme0n1p128 11M 1.4M 9.2M 13% /boot/efi 2025-12-04T09:51:35.2962984Z tmpfs 6.5G 0 6.5G 0% /run/user/0 2025-12-04T09:51:35.2990213Z Prepare all required actions 2025-12-04T09:51:35.2991341Z Getting action download info 2025-12-04T09:51:35.4862621Z ##[group]Run ./.github/actions/download-td-artifacts 2025-12-04T09:51:35.4863039Z with: 2025-12-04T09:51:35.4863274Z env: 2025-12-04T09:51:35.4863525Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:51:35.4863848Z HAS_NVIDIA_GPU: true 2025-12-04T09:51:35.4864245Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:51:35.4864684Z ##[endgroup] 2025-12-04T09:51:35.4900119Z ##[group]Run seemethere/download-artifact-s3@v4 2025-12-04T09:51:35.4900559Z with: 2025-12-04T09:51:35.4900810Z name: td_results 2025-12-04T09:51:35.4901111Z s3-bucket: gha-artifacts 2025-12-04T09:51:35.4901446Z region: us-east-1 2025-12-04T09:51:35.4901711Z env: 2025-12-04T09:51:35.4901970Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:51:35.4902294Z HAS_NVIDIA_GPU: true 2025-12-04T09:51:35.4902676Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:51:35.4903164Z ##[endgroup] 2025-12-04T09:51:35.9259798Z (node:60864) NOTE: We are formalizing our plans to enter AWS SDK for JavaScript (v2) into maintenance mode in 2023. 2025-12-04T09:51:35.9260261Z 2025-12-04T09:51:35.9260444Z Please migrate your code to use AWS SDK for JavaScript (v3). 2025-12-04T09:51:35.9260937Z For more information, check the migration guide at https://a.co/7PzMCcy 2025-12-04T09:51:35.9261437Z (Use `node --trace-warnings ...` to show where the warning was created) 2025-12-04T09:51:36.0319082Z Found 1 objects with prefix pytorch/pytorch/19922826259/td_results/ 2025-12-04T09:51:36.0319685Z Starting download (1/1): /home/ec2-user/actions-runner/_work/pytorch/pytorch/td_results.json 2025-12-04T09:51:36.0941727Z Finished download (1/1): /home/ec2-user/actions-runner/_work/pytorch/pytorch/td_results.json 2025-12-04T09:51:36.0946674Z Artifact download has finished successfully 2025-12-04T09:51:36.1194904Z ##[group]Run mkdir -p .additional_ci_files 2025-12-04T09:51:36.1195215Z mkdir -p .additional_ci_files 2025-12-04T09:51:36.1195535Z mv td_results.json .additional_ci_files/td_results.json || true 2025-12-04T09:51:36.1203540Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:51:36.1203811Z env: 2025-12-04T09:51:36.1203975Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:51:36.1204168Z HAS_NVIDIA_GPU: true 2025-12-04T09:51:36.1204384Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:51:36.1204631Z ##[endgroup] 2025-12-04T09:51:36.1304194Z ##[group]Run .github/scripts/parse_ref.py 2025-12-04T09:51:36.1304514Z .github/scripts/parse_ref.py 2025-12-04T09:51:36.1312121Z shell: /usr/bin/bash -e {0} 2025-12-04T09:51:36.1312324Z env: 2025-12-04T09:51:36.1312483Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:51:36.1312674Z HAS_NVIDIA_GPU: true 2025-12-04T09:51:36.1312908Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:51:36.1313170Z ##[endgroup] 2025-12-04T09:51:36.1530204Z Setting output branch=main 2025-12-04T09:51:36.1628294Z Prepare all required actions 2025-12-04T09:51:36.1628622Z Getting action download info 2025-12-04T09:51:36.3202110Z ##[group]Run ./.github/actions/filter-test-configs 2025-12-04T09:51:36.3202347Z with: 2025-12-04T09:51:36.3202672Z github-token: *** 2025-12-04T09:51:36.3208663Z test-matrix: {"include": [{"config": "default", "shard": 1, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 6, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 6, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 7, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 7, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}]} 2025-12-04T09:51:36.3215347Z job-name: linux-jammy-cuda12.8-py3.10-gcc11-debug / test (default, 1, 7, linux.g6.4xlarge.experimental.nvidia.gpu, oncall:debug-build, rerun_disabled... 2025-12-04T09:51:36.3215875Z env: 2025-12-04T09:51:36.3216035Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:51:36.3216222Z HAS_NVIDIA_GPU: true 2025-12-04T09:51:36.3216436Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:51:36.3216692Z ##[endgroup] 2025-12-04T09:51:36.3245208Z ##[group]Run nick-fields/retry@v3.0.0 2025-12-04T09:51:36.3245420Z with: 2025-12-04T09:51:36.3245572Z shell: bash 2025-12-04T09:51:36.3245746Z timeout_minutes: 10 2025-12-04T09:51:36.3245914Z max_attempts: 5 2025-12-04T09:51:36.3246083Z retry_wait_seconds: 30 2025-12-04T09:51:36.3246650Z command: set -eux # PyYAML 6.0 doesn't work with MacOS x86 anymore # This must run on Python-3.7 (AmazonLinux2) so can't use request=3.32.2 python3 -m pip install requests==2.27.1 pyyaml==6.0.2 2025-12-04T09:51:36.3247246Z polling_interval_seconds: 1 2025-12-04T09:51:36.3247440Z warning_on_retry: true 2025-12-04T09:51:36.3247626Z continue_on_error: false 2025-12-04T09:51:36.3247805Z env: 2025-12-04T09:51:36.3248120Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:51:36.3248327Z HAS_NVIDIA_GPU: true 2025-12-04T09:51:36.3248550Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:51:36.3248940Z GITHUB_TOKEN: *** 2025-12-04T09:51:36.3249119Z ##[endgroup] 2025-12-04T09:51:36.4230543Z + python3 -m pip install requests==2.27.1 pyyaml==6.0.2 2025-12-04T09:51:36.6423114Z Defaulting to user installation because normal site-packages is not writeable 2025-12-04T09:51:36.7527260Z Collecting requests==2.27.1 2025-12-04T09:51:36.7681562Z Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB) 2025-12-04T09:51:36.9293594Z Collecting pyyaml==6.0.2 2025-12-04T09:51:36.9353971Z Downloading PyYAML-6.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (737 kB) 2025-12-04T09:51:37.3059895Z Collecting charset-normalizer~=2.0.0 2025-12-04T09:51:37.3095809Z Downloading charset_normalizer-2.0.12-py3-none-any.whl (39 kB) 2025-12-04T09:51:37.3149459Z Requirement already satisfied: idna<4,>=2.5 in /usr/lib/python3.9/site-packages (from requests==2.27.1) (2.10) 2025-12-04T09:51:37.3152637Z Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/lib/python3.9/site-packages (from requests==2.27.1) (1.25.10) 2025-12-04T09:51:37.3593994Z Collecting certifi>=2017.4.17 2025-12-04T09:51:37.3633139Z Downloading certifi-2025.11.12-py3-none-any.whl (159 kB) 2025-12-04T09:51:37.4420466Z Installing collected packages: charset-normalizer, certifi, requests, pyyaml 2025-12-04T09:51:37.5556853Z Successfully installed certifi-2025.11.12 charset-normalizer-2.0.12 pyyaml-6.0.2 requests-2.27.1 2025-12-04T09:51:38.4005456Z Command completed after 1 attempt(s). 2025-12-04T09:51:38.4066932Z ##[group]Run set -x 2025-12-04T09:51:38.4067139Z set -x 2025-12-04T09:51:38.4067406Z  2025-12-04T09:51:38.4067686Z # Use relative path here as this could be checked out anywhere, not necessarily 2025-12-04T09:51:38.4068042Z # in runner workspace 2025-12-04T09:51:38.4068337Z python3 "${GITHUB_ACTION_PATH}/../../scripts/parse_ref.py" 2025-12-04T09:51:38.4076269Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:51:38.4087384Z env: 2025-12-04T09:51:38.4087585Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:51:38.4087789Z HAS_NVIDIA_GPU: true 2025-12-04T09:51:38.4088036Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:51:38.4088304Z ##[endgroup] 2025-12-04T09:51:38.4118502Z + python3 /home/ec2-user/actions-runner/_work/pytorch/pytorch/./.github/actions/filter-test-configs/../../scripts/parse_ref.py 2025-12-04T09:51:38.4296717Z Setting output branch=main 2025-12-04T09:51:38.4342628Z ##[group]Run echo "Workflow: ${GITHUB_WORKFLOW}" 2025-12-04T09:51:38.4342933Z echo "Workflow: ${GITHUB_WORKFLOW}" 2025-12-04T09:51:38.4343177Z echo "Job name: ${JOB_NAME}" 2025-12-04T09:51:38.4343385Z  2025-12-04T09:51:38.4343655Z # Use relative path here as this could be checked out anywhere, not necessarily 2025-12-04T09:51:38.4344068Z # in runner workspace 2025-12-04T09:51:38.4344378Z python3 "${GITHUB_ACTION_PATH}/../../scripts/filter_test_configs.py" \ 2025-12-04T09:51:38.4344726Z  --workflow "${GITHUB_WORKFLOW}" \ 2025-12-04T09:51:38.4344957Z  --job-name "${JOB_NAME}" \ 2025-12-04T09:51:38.4351439Z  --test-matrix "{"include": [{"config": "default", "shard": 1, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 6, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 6, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 7, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 7, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}]}" \ 2025-12-04T09:51:38.4358225Z  --selected-test-configs "" \ 2025-12-04T09:51:38.4358473Z  --pr-number "${PR_NUMBER}" \ 2025-12-04T09:51:38.4358697Z  --tag "${TAG}" \ 2025-12-04T09:51:38.4358897Z  --event-name "${EVENT_NAME}" \ 2025-12-04T09:51:38.4359124Z  --schedule "${SCHEDULE}" \ 2025-12-04T09:51:38.4359336Z  --branch "${HEAD_BRANCH}" 2025-12-04T09:51:38.4366796Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:51:38.4367080Z env: 2025-12-04T09:51:38.4367236Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:51:38.4367421Z HAS_NVIDIA_GPU: true 2025-12-04T09:51:38.4367641Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:51:38.4368075Z GITHUB_TOKEN: *** 2025-12-04T09:51:38.4368757Z JOB_NAME: linux-jammy-cuda12.8-py3.10-gcc11-debug / test (default, 1, 7, linux.g6.4xlarge.experimental.nvidia.gpu, oncall:debug-build, rerun_disabled... 2025-12-04T09:51:38.4369274Z PR_NUMBER: 2025-12-04T09:51:38.4369428Z TAG: 2025-12-04T09:51:38.4369573Z EVENT_NAME: schedule 2025-12-04T09:51:38.4369739Z SCHEDULE: 29 8 * * * 2025-12-04T09:51:38.4369909Z HEAD_BRANCH: main 2025-12-04T09:51:38.4370074Z ##[endgroup] 2025-12-04T09:51:38.4396725Z Workflow: periodic 2025-12-04T09:51:38.4397385Z Job name: linux-jammy-cuda12.8-py3.10-gcc11-debug / test (default, 1, 7, linux.g6.4xlarge.experimental.nvidia.gpu, oncall:debug-build, rerun_disabled... 2025-12-04T09:51:38.6272293Z Setting output keep-going=True 2025-12-04T09:51:38.6272610Z Setting output ci-verbose-test-logs=False 2025-12-04T09:51:38.6272924Z Setting output ci-test-showlocals=False 2025-12-04T09:51:38.6273218Z Setting output ci-no-test-timeout=False 2025-12-04T09:51:38.6273491Z Setting output ci-no-td=False 2025-12-04T09:51:38.6273759Z Setting output ci-td-distributed=False 2025-12-04T09:51:38.6274064Z Setting output is-unstable=False 2025-12-04T09:51:38.6274331Z Setting output reenabled-issues= 2025-12-04T09:51:38.6289478Z Setting output test-matrix={"include": [{"config": "default", "shard": 1, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 1, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 6, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 6, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 6, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 6, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 7, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 7, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 7, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 7, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}]} 2025-12-04T09:51:38.6303010Z Setting output is-test-matrix-empty=False 2025-12-04T09:51:38.6379007Z ##[group]Run echo "Filtered matrix:" 2025-12-04T09:51:38.6379299Z echo "Filtered matrix:" 2025-12-04T09:51:38.6393114Z echo "{"include": [{"config": "default", "shard": 1, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 1, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 1, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 2, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 2, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 3, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 3, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 4, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 4, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 5, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 5, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 6, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 6, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 6, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 6, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 7, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 7, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "mem_leak_check": "mem_leak_check", "rerun_disabled_tests": "rerun_disabled_tests"}, {"config": "default", "shard": 7, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests", "mem_leak_check": "mem_leak_check"}, {"config": "default", "shard": 7, "num_shards": 7, "runner": "linux.g6.4xlarge.experimental.nvidia.gpu", "owners": ["oncall:debug-build"], "rerun_disabled_tests": "rerun_disabled_tests"}]}" 2025-12-04T09:51:38.6406696Z  2025-12-04T09:51:38.6406852Z echo 2025-12-04T09:51:38.6407053Z echo "Is the current job unstable? False" 2025-12-04T09:51:38.6407287Z  2025-12-04T09:51:38.6407432Z echo 2025-12-04T09:51:38.6407609Z echo "Is keep-going label set? True" 2025-12-04T09:51:38.6407829Z  2025-12-04T09:51:38.6407970Z echo 2025-12-04T09:51:38.6408145Z echo "Reenabled issues? " 2025-12-04T09:51:38.6415834Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:51:38.6416100Z env: 2025-12-04T09:51:38.6416268Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:51:38.6416468Z HAS_NVIDIA_GPU: true 2025-12-04T09:51:38.6416692Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:51:38.6417055Z ##[endgroup] 2025-12-04T09:51:38.6444579Z Filtered matrix: 2025-12-04T09:51:38.6462049Z {include: [{config: default, shard: 1, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], mem_leak_check: mem_leak_check}, {config: default, shard: 1, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 1, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: default, shard: 1, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 2, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], mem_leak_check: mem_leak_check}, {config: default, shard: 2, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 2, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: default, shard: 2, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 3, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], mem_leak_check: mem_leak_check}, {config: default, shard: 3, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 3, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: default, shard: 3, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 4, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], mem_leak_check: mem_leak_check}, {config: default, shard: 4, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 4, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: default, shard: 4, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 5, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], mem_leak_check: mem_leak_check}, {config: default, shard: 5, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 5, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: default, shard: 5, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 6, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], mem_leak_check: mem_leak_check}, {config: default, shard: 6, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 6, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: default, shard: 6, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 7, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], mem_leak_check: mem_leak_check}, {config: default, shard: 7, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], mem_leak_check: mem_leak_check, rerun_disabled_tests: rerun_disabled_tests}, {config: default, shard: 7, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], rerun_disabled_tests: rerun_disabled_tests, mem_leak_check: mem_leak_check}, {config: default, shard: 7, num_shards: 7, runner: linux.g6.4xlarge.experimental.nvidia.gpu, owners: [oncall:debug-build], rerun_disabled_tests: rerun_disabled_tests}]} 2025-12-04T09:51:38.6475660Z 2025-12-04T09:51:38.6475764Z Is the current job unstable? False 2025-12-04T09:51:38.6475910Z 2025-12-04T09:51:38.6475994Z Is keep-going label set? True 2025-12-04T09:51:38.6476127Z 2025-12-04T09:51:38.6476193Z Reenabled issues? 2025-12-04T09:51:38.6504211Z ##[group]Run echo "timeout=$((JOB_TIMEOUT-30))" >> "${GITHUB_OUTPUT}" 2025-12-04T09:51:38.6504643Z echo "timeout=$((JOB_TIMEOUT-30))" >> "${GITHUB_OUTPUT}" 2025-12-04T09:51:38.6511869Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:51:38.6512144Z env: 2025-12-04T09:51:38.6512323Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:51:38.6512514Z HAS_NVIDIA_GPU: true 2025-12-04T09:51:38.6512732Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:51:38.6512985Z JOB_TIMEOUT: 240 2025-12-04T09:51:38.6513153Z ##[endgroup] 2025-12-04T09:51:38.6560638Z ##[group]Run env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T09:51:38.6561065Z env | grep '^GITHUB' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T09:51:38.6561395Z env | grep '^CI' >> "/tmp/github_env_${GITHUB_RUN_ID}" 2025-12-04T09:51:38.6568455Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T09:51:38.6568741Z env: 2025-12-04T09:51:38.6568904Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:51:38.6569094Z HAS_NVIDIA_GPU: true 2025-12-04T09:51:38.6569314Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:51:38.6569567Z ##[endgroup] 2025-12-04T09:51:38.6668576Z ##[group]Run set -x 2025-12-04T09:51:38.6668858Z set -x 2025-12-04T09:51:38.6669199Z  2025-12-04T09:51:38.6669389Z if [[ $TEST_CONFIG == 'multigpu' ]]; then 2025-12-04T09:51:38.6669673Z  TEST_COMMAND=.ci/pytorch/multigpu-test.sh 2025-12-04T09:51:38.6669952Z elif [[ $BUILD_ENVIRONMENT == *onnx* ]]; then 2025-12-04T09:51:38.6670207Z  TEST_COMMAND=.ci/onnx/test.sh 2025-12-04T09:51:38.6670415Z else 2025-12-04T09:51:38.6670603Z  TEST_COMMAND=.ci/pytorch/test.sh 2025-12-04T09:51:38.6670816Z fi 2025-12-04T09:51:38.6670964Z  2025-12-04T09:51:38.6671150Z # Leaving 1GB for the runner and other things 2025-12-04T09:51:38.6671557Z TOTAL_AVAILABLE_MEMORY_IN_GB=$(awk '/MemTotal/ { printf "%.3f \n", $2/1024/1024 - 1 }' /proc/meminfo) 2025-12-04T09:51:38.6672172Z # https://docs.docker.com/engine/containers/resource_constraints/#--memory-swap-details, the 3GB swap 2025-12-04T09:51:38.6672666Z # comes from https://github.com/pytorch/test-infra/pull/6058 2025-12-04T09:51:38.6673064Z TOTAL_MEMORY_WITH_SWAP=$(("${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}" + 3)) 2025-12-04T09:51:38.6673355Z  2025-12-04T09:51:38.6673538Z if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then 2025-12-04T09:51:38.6673781Z  SHM_OPTS= 2025-12-04T09:51:38.6673958Z  JENKINS_USER= 2025-12-04T09:51:38.6674199Z  # ensure that docker container cleanly exits in 12 hours 2025-12-04T09:51:38.6674543Z  # if for some reason cleanup action doesn't stop container 2025-12-04T09:51:38.6674818Z  # when job is cancelled 2025-12-04T09:51:38.6675032Z  DOCKER_SHELL_CMD="sleep 12h" 2025-12-04T09:51:38.6675256Z  USED_IMAGE="${DOCKER_IMAGE_S390X}" 2025-12-04T09:51:38.6675467Z else 2025-12-04T09:51:38.6675643Z  SHM_OPTS="--shm-size=${SHM_SIZE}" 2025-12-04T09:51:38.6675872Z  JENKINS_USER="--user jenkins" 2025-12-04T09:51:38.6676089Z  DOCKER_SHELL_CMD= 2025-12-04T09:51:38.6676293Z  USED_IMAGE="${DOCKER_IMAGE}" 2025-12-04T09:51:38.6676486Z fi 2025-12-04T09:51:38.6676635Z  2025-12-04T09:51:38.6676867Z # detached container should get cleaned up by teardown_ec2_linux 2025-12-04T09:51:38.6677233Z # TODO: Stop building test binaries as part of the build phase 2025-12-04T09:51:38.6677660Z # Used for GPU_FLAG, SHM_OPTS, JENKINS_USER and DOCKER_SHELL_CMD since that doesn't play nice 2025-12-04T09:51:38.6678039Z # shellcheck disable=SC2086,SC2090 2025-12-04T09:51:38.6678273Z container_name=$(docker run \ 2025-12-04T09:51:38.6678488Z  ${GPU_FLAG:-} \ 2025-12-04T09:51:38.6678705Z  ${SCCACHE_SERVER_PORT_DOCKER_FLAG:-} \ 2025-12-04T09:51:38.6678951Z  -e BUILD_ENVIRONMENT \ 2025-12-04T09:51:38.6679160Z  -e PR_NUMBER \ 2025-12-04T09:51:38.6679348Z  -e GITHUB_ACTIONS \ 2025-12-04T09:51:38.6679547Z  -e GITHUB_REPOSITORY \ 2025-12-04T09:51:38.6679769Z  -e GITHUB_WORKFLOW \ 2025-12-04T09:51:38.6679962Z  -e GITHUB_JOB \ 2025-12-04T09:51:38.6680147Z  -e GITHUB_RUN_ID \ 2025-12-04T09:51:38.6680344Z  -e GITHUB_RUN_NUMBER \ 2025-12-04T09:51:38.6680542Z  -e GITHUB_RUN_ATTEMPT \ 2025-12-04T09:51:38.6680758Z  -e JOB_ID \ 2025-12-04T09:51:38.6680942Z  -e JOB_NAME \ 2025-12-04T09:51:38.6681121Z  -e BASE_SHA \ 2025-12-04T09:51:38.6681293Z  -e BRANCH \ 2025-12-04T09:51:38.6681459Z  -e SHA1 \ 2025-12-04T09:51:38.6681634Z  -e AWS_DEFAULT_REGION \ 2025-12-04T09:51:38.6681832Z  -e IN_WHEEL_TEST \ 2025-12-04T09:51:38.6682035Z  -e SHARD_NUMBER \ 2025-12-04T09:51:38.6682229Z  -e TEST_CONFIG \ 2025-12-04T09:51:38.6682413Z  -e NUM_TEST_SHARDS \ 2025-12-04T09:51:38.6682723Z  -e REENABLED_ISSUES \ 2025-12-04T09:51:38.6682954Z  -e CONTINUE_THROUGH_ERROR \ 2025-12-04T09:51:38.6683245Z  -e VERBOSE_TEST_LOGS \ 2025-12-04T09:51:38.6683446Z  -e TEST_SHOWLOCALS \ 2025-12-04T09:51:38.6683638Z  -e NO_TEST_TIMEOUT \ 2025-12-04T09:51:38.6683825Z  -e NO_TD \ 2025-12-04T09:51:38.6683995Z  -e TD_DISTRIBUTED \ 2025-12-04T09:51:38.6684187Z  -e PR_LABELS \ 2025-12-04T09:51:38.6684404Z  -e MAX_JOBS="$(nproc --ignore=2)" \ 2025-12-04T09:51:38.6684627Z  -e SCCACHE_BUCKET \ 2025-12-04T09:51:38.6684813Z  -e SCCACHE_REGION \ 2025-12-04T09:51:38.6685000Z  -e XLA_CUDA \ 2025-12-04T09:51:38.6685192Z  -e XLA_CLANG_CACHE_S3_BUCKET_NAME \ 2025-12-04T09:51:38.6685454Z  -e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK \ 2025-12-04T09:51:38.6685710Z  -e PYTORCH_TEST_RERUN_DISABLED_TESTS \ 2025-12-04T09:51:38.6685967Z  -e SKIP_SCCACHE_INITIALIZATION=1 \ 2025-12-04T09:51:38.6686195Z  -e HUGGING_FACE_HUB_TOKEN \ 2025-12-04T09:51:38.6686436Z  -e VLLM_TEST_HUGGING_FACE_TOKEN \ 2025-12-04T09:51:38.6686678Z  -e SCRIBE_GRAPHQL_ACCESS_TOKEN \ 2025-12-04T09:51:38.6686893Z  -e DASHBOARD_TAG \ 2025-12-04T09:51:38.6687090Z  -e ARTIFACTS_FILE_SUFFIX \ 2025-12-04T09:51:38.6687347Z  --memory="${TOTAL_AVAILABLE_MEMORY_IN_GB%.*}g" \ 2025-12-04T09:51:38.6687630Z  --memory-swap="${TOTAL_MEMORY_WITH_SWAP}g" \ 2025-12-04T09:51:38.6687928Z  --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \ 2025-12-04T09:51:38.6688201Z  --security-opt seccomp=unconfined \ 2025-12-04T09:51:38.6688433Z  --cap-add=SYS_PTRACE \ 2025-12-04T09:51:38.6688631Z  --ipc=host \ 2025-12-04T09:51:38.6688813Z  ${SHM_OPTS} \ 2025-12-04T09:51:38.6688989Z  --tty \ 2025-12-04T09:51:38.6689154Z  --detach \ 2025-12-04T09:51:38.6689345Z  --name="${container_name}" \ 2025-12-04T09:51:38.6689564Z  ${JENKINS_USER} \ 2025-12-04T09:51:38.6689819Z  -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \ 2025-12-04T09:51:38.6690093Z  -w /var/lib/jenkins/workspace \ 2025-12-04T09:51:38.6690309Z  "${USED_IMAGE}" \ 2025-12-04T09:51:38.6690501Z  ${DOCKER_SHELL_CMD} 2025-12-04T09:51:38.6690690Z ) 2025-12-04T09:51:38.6690925Z echo "DOCKER_CONTAINER_ID=${container_name}" >> "${GITHUB_ENV}" 2025-12-04T09:51:38.6691210Z  2025-12-04T09:51:38.6691399Z if [[ ${BUILD_ENVIRONMENT} == *"s390x"* ]]; then 2025-12-04T09:51:38.6691883Z  docker exec -t "${container_name}" sh -c "python3 -m pip install -r .ci/docker/requirements-ci.txt" 2025-12-04T09:51:38.6692310Z fi 2025-12-04T09:51:38.6692484Z  2025-12-04T09:51:38.6692885Z docker exec -t "${container_name}" sh -c "python3 -m pip install $(echo dist/*.whl)[opt-einsum] && ${TEST_COMMAND}" 2025-12-04T09:51:38.6699991Z shell: /usr/bin/bash -e {0} 2025-12-04T09:51:38.6700196Z env: 2025-12-04T09:51:38.6700346Z GIT_DEFAULT_BRANCH: main 2025-12-04T09:51:38.6700529Z HAS_NVIDIA_GPU: true 2025-12-04T09:51:38.6700758Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:51:38.6701080Z BUILD_ENVIRONMENT: linux-jammy-cuda12.8-py3.10-gcc11-debug 2025-12-04T09:51:38.6701349Z PR_NUMBER: 2025-12-04T09:51:38.6701523Z GITHUB_REPOSITORY: pytorch/pytorch 2025-12-04T09:51:38.6701743Z GITHUB_WORKFLOW: periodic 2025-12-04T09:51:38.6701927Z GITHUB_JOB: test 2025-12-04T09:51:38.6702093Z GITHUB_RUN_ID: 19922826259 2025-12-04T09:51:38.6702275Z GITHUB_RUN_NUMBER: 19107 2025-12-04T09:51:38.6702450Z GITHUB_RUN_ATTEMPT: 1 2025-12-04T09:51:38.6702616Z JOB_ID: 57120265563 2025-12-04T09:51:38.6703216Z JOB_NAME: linux-jammy-cuda12.8-py3.10-gcc11-debug / test (default, 1, 7, linux.g6.4xlarge.experimental.nvidia.gpu, oncall:debug-build, rerun_disabled... 2025-12-04T09:51:38.6703745Z BRANCH: main 2025-12-04T09:51:38.6703949Z SHA1: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:51:38.6704302Z BASE_SHA: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:51:38.6704537Z TEST_CONFIG: default 2025-12-04T09:51:38.6704697Z SHARD_NUMBER: 1 2025-12-04T09:51:38.6704864Z NUM_TEST_SHARDS: 7 2025-12-04T09:51:38.6705025Z EXTRA_FLAGS: 2025-12-04T09:51:38.6705190Z OP_BENCHMARK_TESTS: 2025-12-04T09:51:38.6705368Z REENABLED_ISSUES: 2025-12-04T09:51:38.6705545Z CONTINUE_THROUGH_ERROR: True 2025-12-04T09:51:38.6705734Z VERBOSE_TEST_LOGS: False 2025-12-04T09:51:38.6705925Z TEST_SHOWLOCALS: False 2025-12-04T09:51:38.6706113Z NO_TEST_TIMEOUT: False 2025-12-04T09:51:38.6706279Z NO_TD: False 2025-12-04T09:51:38.6706438Z TD_DISTRIBUTED: False 2025-12-04T09:51:38.6706662Z SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2 2025-12-04T09:51:38.6706911Z SCCACHE_REGION: us-east-1 2025-12-04T09:51:38.6707095Z SHM_SIZE: 2g 2025-12-04T09:51:38.6707752Z DOCKER_IMAGE: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:51:38.6708737Z DOCKER_IMAGE_S390X: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:51:38.6709325Z XLA_CUDA: 2025-12-04T09:51:38.6709575Z XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla 2025-12-04T09:51:38.6709902Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK: 0 2025-12-04T09:51:38.6710120Z PYTORCH_TEST_RERUN_DISABLED_TESTS: 1 2025-12-04T09:51:38.6710350Z DASHBOARD_TAG: 2025-12-04T09:51:38.6710677Z VLLM_TEST_HUGGING_FACE_TOKEN: *** 2025-12-04T09:51:38.6710967Z HUGGING_FACE_HUB_TOKEN: *** 2025-12-04T09:51:38.6711254Z SCRIBE_GRAPHQL_ACCESS_TOKEN: *** 2025-12-04T09:51:38.6711623Z ARTIFACTS_FILE_SUFFIX: test-default-1-7-linux.g6.4xlarge.experimental.nvidia.gpu_57120265563 2025-12-04T09:51:38.6712012Z ##[endgroup] 2025-12-04T09:51:38.6737699Z + [[ default == \m\u\l\t\i\g\p\u ]] 2025-12-04T09:51:38.6738206Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *onnx* ]] 2025-12-04T09:51:38.6738564Z + TEST_COMMAND=.ci/pytorch/test.sh 2025-12-04T09:51:38.6741038Z ++ awk '/MemTotal/ { printf "%.3f \n", $2/1024/1024 - 1 }' /proc/meminfo 2025-12-04T09:51:38.6765088Z + TOTAL_AVAILABLE_MEMORY_IN_GB='59.453 ' 2025-12-04T09:51:38.6765587Z + TOTAL_MEMORY_WITH_SWAP=62 2025-12-04T09:51:38.6766003Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *\s\3\9\0\x* ]] 2025-12-04T09:51:38.6766360Z + SHM_OPTS=--shm-size=2g 2025-12-04T09:51:38.6766593Z + JENKINS_USER='--user jenkins' 2025-12-04T09:51:38.6766836Z + DOCKER_SHELL_CMD= 2025-12-04T09:51:38.6767536Z + USED_IMAGE=308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:51:38.6774184Z +++ nproc --ignore=2 2025-12-04T09:51:38.6808675Z ++ docker run --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all -e BUILD_ENVIRONMENT -e PR_NUMBER -e GITHUB_ACTIONS -e GITHUB_REPOSITORY -e GITHUB_WORKFLOW -e GITHUB_JOB -e GITHUB_RUN_ID -e GITHUB_RUN_NUMBER -e GITHUB_RUN_ATTEMPT -e JOB_ID -e JOB_NAME -e BASE_SHA -e BRANCH -e SHA1 -e AWS_DEFAULT_REGION -e IN_WHEEL_TEST -e SHARD_NUMBER -e TEST_CONFIG -e NUM_TEST_SHARDS -e REENABLED_ISSUES -e CONTINUE_THROUGH_ERROR -e VERBOSE_TEST_LOGS -e TEST_SHOWLOCALS -e NO_TEST_TIMEOUT -e NO_TD -e TD_DISTRIBUTED -e PR_LABELS -e MAX_JOBS=14 -e SCCACHE_BUCKET -e SCCACHE_REGION -e XLA_CUDA -e XLA_CLANG_CACHE_S3_BUCKET_NAME -e PYTORCH_TEST_CUDA_MEM_LEAK_CHECK -e PYTORCH_TEST_RERUN_DISABLED_TESTS -e SKIP_SCCACHE_INITIALIZATION=1 -e HUGGING_FACE_HUB_TOKEN -e VLLM_TEST_HUGGING_FACE_TOKEN -e SCRIBE_GRAPHQL_ACCESS_TOKEN -e DASHBOARD_TAG -e ARTIFACTS_FILE_SUFFIX --memory=59g --memory-swap=62g --env-file=/tmp/github_env_19922826259 --security-opt seccomp=unconfined --cap-add=SYS_PTRACE --ipc=host --shm-size=2g --tty --detach --name= --user jenkins -v /home/ec2-user/actions-runner/_work/pytorch/pytorch:/var/lib/jenkins/workspace -w /var/lib/jenkins/workspace 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T09:51:49.6531910Z + container_name=7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T09:51:49.6532631Z + echo DOCKER_CONTAINER_ID=7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T09:51:49.6534141Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *\s\3\9\0\x* ]] 2025-12-04T09:51:49.6539717Z ++ echo dist/torch-2.10.0a0+gitffd9b0f-cp310-cp310-linux_x86_64.whl 2025-12-04T09:51:49.6542192Z + docker exec -t 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 sh -c 'python3 -m pip install dist/torch-2.10.0a0+gitffd9b0f-cp310-cp310-linux_x86_64.whl[opt-einsum] && .ci/pytorch/test.sh' 2025-12-04T09:51:50.0919841Z Processing ./dist/torch-2.10.0a0+gitffd9b0f-cp310-cp310-linux_x86_64.whl (from torch==2.10.0a0+gitffd9b0f) 2025-12-04T09:51:50.3982971Z Requirement already satisfied: filelock in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (3.18.0) 2025-12-04T09:51:50.3985642Z Requirement already satisfied: typing-extensions>=4.10.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (4.12.2) 2025-12-04T09:51:50.3989339Z Requirement already satisfied: sympy>=1.13.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (1.13.3) 2025-12-04T09:51:50.3993040Z Requirement already satisfied: networkx>=2.5.1 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (2.8.8) 2025-12-04T09:51:50.3996235Z Requirement already satisfied: jinja2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (3.1.6) 2025-12-04T09:51:50.4000034Z Requirement already satisfied: fsspec>=0.8.5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (2025.10.0) 2025-12-04T09:51:50.4011333Z Requirement already satisfied: opt-einsum>=3.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (3.3.0) 2025-12-04T09:51:50.4330443Z Requirement already satisfied: numpy>=1.7 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from opt-einsum>=3.3->torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (1.22.4) 2025-12-04T09:51:50.4347139Z Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from sympy>=1.13.3->torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (1.3.0) 2025-12-04T09:51:50.4397582Z Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from jinja2->torch==2.10.0a0+gitffd9b0f->torch==2.10.0a0+gitffd9b0f) (3.0.3) 2025-12-04T09:51:50.7547374Z Installing collected packages: torch 2025-12-04T09:52:01.1656444Z Successfully installed torch-2.10.0a0+gitffd9b0f 2025-12-04T09:52:01.2324304Z + export TERM=vt100 2025-12-04T09:52:01.2324611Z + TERM=vt100 2025-12-04T09:52:01.2326367Z ++ dirname .ci/pytorch/test.sh 2025-12-04T09:52:01.2336559Z + source .ci/pytorch/common.sh 2025-12-04T09:52:01.2341123Z +++ dirname .ci/pytorch/common.sh 2025-12-04T09:52:01.2436732Z ++ source .ci/pytorch/common_utils.sh 2025-12-04T09:52:01.2437950Z +++ declare -f -t trap_add 2025-12-04T09:52:01.2442878Z ++ set -ex -o pipefail 2025-12-04T09:52:01.2443143Z ++ [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *rocm* ]] 2025-12-04T09:52:01.2496084Z ++ BUILD_TEST_LIBTORCH=0 2025-12-04T09:52:01.2496366Z ++ dirname .ci/pytorch/test.sh 2025-12-04T09:52:01.2504740Z + source .ci/pytorch/common-build.sh 2025-12-04T09:52:01.2506565Z ++ [[ linux-jammy-cuda12.8-py3.10-gcc11-debug != *win-* ]] 2025-12-04T09:52:01.2512649Z ++++ dirname .ci/pytorch/common-build.sh 2025-12-04T09:52:01.2522406Z +++ cd .ci/pytorch 2025-12-04T09:52:01.2522857Z +++ pwd -P 2025-12-04T09:52:01.2525580Z ++ script_dir=/var/lib/jenkins/workspace/.ci/pytorch 2025-12-04T09:52:01.2525928Z ++ [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *-pch* ]] 2025-12-04T09:52:01.2526182Z ++ which sccache 2025-12-04T09:52:01.2559889Z ++ [[ -z ossci-compiler-cache-circleci-v2 ]] 2025-12-04T09:52:01.2560228Z ++ sccache --stop-server 2025-12-04T09:52:01.2589947Z ++ true 2025-12-04T09:52:01.2590158Z ++ rm -f /var/lib/jenkins/sccache_error.log 2025-12-04T09:52:01.2601967Z ++ trap_add sccache_epilogue EXIT 2025-12-04T09:52:01.2602213Z ++ trap_add_cmd=sccache_epilogue 2025-12-04T09:52:01.2602407Z ++ shift 2025-12-04T09:52:01.2602567Z ++ for trap_add_name in "$@" 2025-12-04T09:52:01.2607640Z ++++ trap -p EXIT 2025-12-04T09:52:01.2610933Z +++ eval 'extract_trap_cmd ' 2025-12-04T09:52:01.2611175Z ++++ extract_trap_cmd 2025-12-04T09:52:01.2611365Z ++++ printf '%s\n' '' 2025-12-04T09:52:01.2611590Z +++ printf '%s\n' sccache_epilogue 2025-12-04T09:52:01.2613422Z ++ trap -- ' 2025-12-04T09:52:01.2613708Z sccache_epilogue' EXIT 2025-12-04T09:52:01.2613955Z ++ [[ -n 1 ]] 2025-12-04T09:52:01.2615756Z ++ echo 'Skipping sccache server initialization, setting environment variables' 2025-12-04T09:52:01.2616459Z Skipping sccache server initialization, setting environment variables 2025-12-04T09:52:01.2616872Z ++ export SCCACHE_IDLE_TIMEOUT=0 2025-12-04T09:52:01.2617131Z ++ SCCACHE_IDLE_TIMEOUT=0 2025-12-04T09:52:01.2617441Z ++ export SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 2025-12-04T09:52:01.2617844Z ++ SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 2025-12-04T09:52:01.2625236Z ++ export RUST_LOG=sccache::server=error 2025-12-04T09:52:01.2625503Z ++ RUST_LOG=sccache::server=error 2025-12-04T09:52:01.2625731Z ++ sccache --zero-stats 2025-12-04T09:52:01.7278343Z Statistics zeroed. 2025-12-04T09:52:01.7285493Z ++ which ccache 2025-12-04T09:52:01.7347062Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug != *rocm* ]] 2025-12-04T09:52:01.7347538Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug != *s390x* ]] 2025-12-04T09:52:01.7347853Z + [[ -d /var/lib/jenkins/workspace ]] 2025-12-04T09:52:01.7350652Z ++ stat -c %u /var/lib/jenkins/workspace 2025-12-04T09:52:01.7369737Z + WORKSPACE_ORIGINAL_OWNER_ID=1000 2025-12-04T09:52:01.7369997Z + trap_add cleanup_workspace EXIT 2025-12-04T09:52:01.7370231Z + trap_add_cmd=cleanup_workspace 2025-12-04T09:52:01.7370431Z + shift 2025-12-04T09:52:01.7370584Z + for trap_add_name in "$@" 2025-12-04T09:52:01.7376715Z +++ trap -p EXIT 2025-12-04T09:52:01.7380087Z ++ eval 'extract_trap_cmd trap -- '\'' 2025-12-04T09:52:01.7380404Z sccache_epilogue'\'' EXIT' 2025-12-04T09:52:01.7380633Z +++ extract_trap_cmd trap -- ' 2025-12-04T09:52:01.7380833Z sccache_epilogue' EXIT 2025-12-04T09:52:01.7381028Z +++ printf '%s\n' ' 2025-12-04T09:52:01.7381199Z sccache_epilogue' 2025-12-04T09:52:01.7381375Z ++ printf '%s\n' cleanup_workspace 2025-12-04T09:52:01.7382891Z + trap -- ' 2025-12-04T09:52:01.7383184Z sccache_epilogue 2025-12-04T09:52:01.7383355Z cleanup_workspace' EXIT 2025-12-04T09:52:01.7393678Z + sudo chown -R jenkins /var/lib/jenkins/workspace 2025-12-04T09:52:02.7298773Z + git config --global --add safe.directory /var/lib/jenkins/workspace 2025-12-04T09:52:02.7321648Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *cuda* ]] 2025-12-04T09:52:02.7324943Z ++ python -c 'import os;import numba.cuda; print(os.path.dirname(numba.cuda.__file__))' 2025-12-04T09:52:03.1334590Z + NUMBA_CUDA_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda 2025-12-04T09:52:03.1335256Z + '[' -n /opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda ']' 2025-12-04T09:52:03.1339950Z +++ realpath .ci/pytorch/test.sh 2025-12-04T09:52:03.1351373Z ++ dirname /var/lib/jenkins/workspace/.ci/pytorch/test.sh 2025-12-04T09:52:03.1360255Z + NUMBA_PATCH=/var/lib/jenkins/workspace/.ci/pytorch/numba-cuda-13.patch 2025-12-04T09:52:03.1361258Z + pushd /opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda 2025-12-04T09:52:03.1361839Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda ~/workspace 2025-12-04T09:52:03.1362425Z + patch -p4 2025-12-04T09:52:03.1376534Z patching file cudadrv/driver.py 2025-12-04T09:52:03.1376968Z Hunk #1 succeeded at 357 (offset -8 lines). 2025-12-04T09:52:03.1453009Z + popd 2025-12-04T09:52:03.1453186Z ~/workspace 2025-12-04T09:52:03.1453473Z + echo 'Environment variables:' 2025-12-04T09:52:03.1453704Z Environment variables: 2025-12-04T09:52:03.1453874Z + env 2025-12-04T09:52:03.1463914Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/pytorch/pytorch 2025-12-04T09:52:03.1464479Z CONTINUE_THROUGH_ERROR=True 2025-12-04T09:52:03.1465028Z BUILD_ENVIRONMENT=linux-jammy-cuda12.8-py3.10-gcc11-debug 2025-12-04T09:52:03.1465823Z VLLM_TEST_HUGGING_FACE_TOKEN=*** 2025-12-04T09:52:03.1466209Z HOSTNAME=7dec456c8d4c 2025-12-04T09:52:03.1466772Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_e84e326d-7eba-4856-a6d5-381fb8e09f2f 2025-12-04T09:52:03.1467559Z GITHUB_ACTION=__run_3 2025-12-04T09:52:03.1467796Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=0 2025-12-04T09:52:03.1468083Z GITHUB_RUN_NUMBER=19107 2025-12-04T09:52:03.1468310Z TEST_CONFIG=default 2025-12-04T09:52:03.1468537Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-12-04T09:52:03.1468836Z TORCH_NVCC_FLAGS=-Xfatbin -compress-all 2025-12-04T09:52:03.1469115Z SCCACHE_IDLE_TIMEOUT=0 2025-12-04T09:52:03.1469493Z SCRIBE_GRAPHQL_ACCESS_TOKEN=*** 2025-12-04T09:52:03.1469752Z GITHUB_TRIGGERING_ACTOR=huydhn 2025-12-04T09:52:03.1469975Z GITHUB_REF_TYPE=branch 2025-12-04T09:52:03.1470176Z BASE_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:52:03.1470412Z XLA_CUDA= 2025-12-04T09:52:03.1470571Z NCCL_LIB_DIR=/usr/local/cuda/lib64/ 2025-12-04T09:52:03.1470863Z HUGGING_FACE_HUB_TOKEN=*** 2025-12-04T09:52:03.1471134Z *** 2025-12-04T09:52:03.1471322Z GITHUB_REPOSITORY_ID=65600975 2025-12-04T09:52:03.1471626Z GITHUB_ACTIONS=true 2025-12-04T09:52:03.1471808Z NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:52:03.1472045Z SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 2025-12-04T09:52:03.1472436Z SHA1=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:52:03.1472712Z GITHUB_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:52:03.1473078Z GITHUB_WORKFLOW_REF=pytorch/pytorch/.github/workflows/periodic.yml@refs/heads/main 2025-12-04T09:52:03.1473418Z UCC_HOME=/usr 2025-12-04T09:52:03.1473582Z VERBOSE_TEST_LOGS=False 2025-12-04T09:52:03.1473763Z GITHUB_REF=refs/heads/main 2025-12-04T09:52:03.1473944Z SHARD_NUMBER=1 2025-12-04T09:52:03.1474106Z GITHUB_REF_PROTECTED=true 2025-12-04T09:52:03.1474289Z HOME=/var/lib/jenkins 2025-12-04T09:52:03.1474483Z GITHUB_API_URL=https://api.github.com 2025-12-04T09:52:03.1474730Z PYTORCH_TEST_RERUN_DISABLED_TESTS=1 2025-12-04T09:52:03.1474977Z UCX_COMMIT=7836b165abdbe468a2f607e7254011c07d788152 2025-12-04T09:52:03.1475211Z USE_SYSTEM_NCCL=1 2025-12-04T09:52:03.1475372Z NUM_TEST_SHARDS=7 2025-12-04T09:52:03.1475526Z UCX_HOME=/usr 2025-12-04T09:52:03.1475931Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_e84e326d-7eba-4856-a6d5-381fb8e09f2f 2025-12-04T09:52:03.1476709Z JOB_NAME=linux-jammy-cuda12.8-py3.10-gcc11-debug / test (default, 1, 7, linux.g6.4xlarge.experimental.nvidia.gpu, oncall:debug-build, rerun_disabled... 2025-12-04T09:52:03.1477453Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_e84e326d-7eba-4856-a6d5-381fb8e09f2f 2025-12-04T09:52:03.1478022Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-12-04T09:52:03.1478361Z GITHUB_EVENT_NAME=schedule 2025-12-04T09:52:03.1478544Z DASHBOARD_TAG= 2025-12-04T09:52:03.1478710Z GITHUB_RUN_ID=19922826259 2025-12-04T09:52:03.1478885Z INSTALLED_OPENBLAS= 2025-12-04T09:52:03.1479314Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_e84e326d-7eba-4856-a6d5-381fb8e09f2f 2025-12-04T09:52:03.1479791Z GITHUB_ACTOR=huydhn 2025-12-04T09:52:03.1479944Z PR_NUMBER= 2025-12-04T09:52:03.1480301Z DESIRED_CUDA=12.8.1 2025-12-04T09:52:03.1480482Z GITHUB_RUN_ATTEMPT=1 2025-12-04T09:52:03.1480812Z ANACONDA_PYTHON_VERSION=3.10 2025-12-04T09:52:03.1481045Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-12-04T09:52:03.1481285Z TERM=vt100 2025-12-04T09:52:03.1481438Z INSTALLED_VISION=yes 2025-12-04T09:52:03.1481602Z BRANCH=main 2025-12-04T09:52:03.1481757Z SCCACHE_REGION=us-east-1 2025-12-04T09:52:03.1481946Z OPENSSL_ROOT_DIR=/opt/openssl 2025-12-04T09:52:03.1482136Z BUILD_AOT_INDUCTOR_TEST= 2025-12-04T09:52:03.1482315Z CUDA_PATH=/usr/local/cuda 2025-12-04T09:52:03.1482686Z GITHUB_ACTION_PATH=/home/ec2-user/actions-runner/_work/pytorch/pytorch/./.github/actions/setup-linux 2025-12-04T09:52:03.1483088Z GITHUB_SERVER_URL=https://github.com 2025-12-04T09:52:03.1483349Z UCC_COMMIT=430e241bf5d38cbc73fc7a6b89155397232e3f96 2025-12-04T09:52:03.1483589Z REENABLED_ISSUES= 2025-12-04T09:52:03.1483741Z DOCS= 2025-12-04T09:52:03.1483878Z SHLVL=1 2025-12-04T09:52:03.1484018Z MAX_JOBS=14 2025-12-04T09:52:03.1484165Z GITHUB_ACTOR_ID=475357 2025-12-04T09:52:03.1484407Z GITHUB_WORKFLOW_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:52:03.1484681Z GITHUB_REF_NAME=main 2025-12-04T09:52:03.1484945Z XLA_CLANG_CACHE_S3_BUCKET_NAME=ossci-compiler-clang-cache-circleci-xla 2025-12-04T09:52:03.1485238Z GITHUB_JOB=test 2025-12-04T09:52:03.1485397Z NO_TEST_TIMEOUT=False 2025-12-04T09:52:03.1485566Z TD_DISTRIBUTED=False 2025-12-04T09:52:03.1485744Z GITHUB_REPOSITORY=pytorch/pytorch 2025-12-04T09:52:03.1485956Z GITHUB_RETENTION_DAYS=90 2025-12-04T09:52:03.1486136Z OPENSSL_DIR=/opt/openssl 2025-12-04T09:52:03.1486312Z GITHUB_ACTION_REPOSITORY= 2025-12-04T09:52:03.1486853Z PATH=/opt/cache/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-12-04T09:52:03.1487409Z GITHUB_BASE_REF= 2025-12-04T09:52:03.1487563Z INSTALLED_ACL= 2025-12-04T09:52:03.1487900Z ARTIFACTS_FILE_SUFFIX=test-default-1-7-linux.g6.4xlarge.experimental.nvidia.gpu_57120265563 2025-12-04T09:52:03.1488270Z CI=true 2025-12-04T09:52:03.1488433Z GITHUB_REPOSITORY_OWNER=pytorch 2025-12-04T09:52:03.1488668Z RUST_LOG=sccache::server=error 2025-12-04T09:52:03.1488860Z JOB_ID=57120265563 2025-12-04T09:52:03.1489037Z GITHUB_HEAD_REF= 2025-12-04T09:52:03.1489197Z GITHUB_ACTION_REF= 2025-12-04T09:52:03.1489401Z SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2 2025-12-04T09:52:03.1489662Z TEST_SHOWLOCALS=False 2025-12-04T09:52:03.1489836Z GITHUB_WORKFLOW=periodic 2025-12-04T09:52:03.1490025Z DEBIAN_FRONTEND=noninteractive 2025-12-04T09:52:03.1490465Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_e84e326d-7eba-4856-a6d5-381fb8e09f2f 2025-12-04T09:52:03.1490899Z NO_TD=False 2025-12-04T09:52:03.1491064Z SKIP_SCCACHE_INITIALIZATION=1 2025-12-04T09:52:03.1491275Z NCCL_INCLUDE_DIR=/usr/local/cuda/include/ 2025-12-04T09:52:03.1491485Z _=/usr/bin/env 2025-12-04T09:52:03.1491737Z OLDPWD=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda 2025-12-04T09:52:03.1492109Z ++ python -c 'import site; print(site.getsitepackages()[0])' 2025-12-04T09:52:03.1596831Z + TORCH_INSTALL_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch 2025-12-04T09:52:03.1597635Z + TORCH_BIN_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T09:52:03.1598178Z + TORCH_LIB_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib 2025-12-04T09:52:03.1598793Z + TORCH_TEST_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/test 2025-12-04T09:52:03.1599185Z + BUILD_DIR=build 2025-12-04T09:52:03.1599401Z + BUILD_RENAMED_DIR=build_renamed 2025-12-04T09:52:03.1599673Z + BUILD_BIN_DIR=build/bin 2025-12-04T09:52:03.1599894Z + SHARD_NUMBER=1 2025-12-04T09:52:03.1600092Z + NUM_TEST_SHARDS=7 2025-12-04T09:52:03.1600284Z + export TORCH_SERIALIZATION_DEBUG=1 2025-12-04T09:52:03.1600669Z + TORCH_SERIALIZATION_DEBUG=1 2025-12-04T09:52:03.1600866Z + export VALGRIND=ON 2025-12-04T09:52:03.1601251Z + VALGRIND=ON 2025-12-04T09:52:03.1601610Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *clang9* ]] 2025-12-04T09:52:03.1602080Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *xpu* ]] 2025-12-04T09:52:03.1602341Z + detect_cuda_arch 2025-12-04T09:52:03.1602553Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *cuda* ]] 2025-12-04T09:52:03.1602813Z + command -v nvidia-smi 2025-12-04T09:52:03.1602986Z /usr/bin/nvidia-smi 2025-12-04T09:52:03.1606732Z ++ nvidia-smi --query-gpu=compute_cap --format=csv 2025-12-04T09:52:03.1607787Z ++ tail -n 1 2025-12-04T09:52:03.1824080Z + TORCH_CUDA_ARCH_LIST=8.9 2025-12-04T09:52:03.1824383Z + export TORCH_CUDA_ARCH_LIST 2025-12-04T09:52:03.1824706Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *s390x* ]] 2025-12-04T09:52:03.1825036Z + [[ 1 == \1 ]] 2025-12-04T09:52:03.1825228Z + ulimit -c 0 2025-12-04T09:52:03.1825495Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug != *bazel* ]] 2025-12-04T09:52:03.1829108Z ++ realpath build/custom_test_artifacts 2025-12-04T09:52:03.2006220Z + CUSTOM_TEST_ARTIFACT_BUILD_DIR=/var/lib/jenkins/workspace/build/custom_test_artifacts 2025-12-04T09:52:03.2006684Z + [[ -n '' ]] 2025-12-04T09:52:03.2006901Z + echo 'Environment variables' 2025-12-04T09:52:03.2007144Z Environment variables 2025-12-04T09:52:03.2007353Z + env 2025-12-04T09:52:03.2158464Z GITHUB_WORKSPACE=/home/ec2-user/actions-runner/_work/pytorch/pytorch 2025-12-04T09:52:03.2159265Z CONTINUE_THROUGH_ERROR=True 2025-12-04T09:52:03.2160093Z BUILD_ENVIRONMENT=linux-jammy-cuda12.8-py3.10-gcc11-debug 2025-12-04T09:52:03.2160766Z VLLM_TEST_HUGGING_FACE_TOKEN=*** 2025-12-04T09:52:03.2161085Z HOSTNAME=7dec456c8d4c 2025-12-04T09:52:03.2161758Z GITHUB_PATH=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/add_path_e84e326d-7eba-4856-a6d5-381fb8e09f2f 2025-12-04T09:52:03.2162237Z GITHUB_ACTION=__run_3 2025-12-04T09:52:03.2162428Z PYTORCH_TEST_CUDA_MEM_LEAK_CHECK=0 2025-12-04T09:52:03.2162645Z GITHUB_RUN_NUMBER=19107 2025-12-04T09:52:03.2162814Z TEST_CONFIG=default 2025-12-04T09:52:03.2163015Z GITHUB_REPOSITORY_OWNER_ID=21003710 2025-12-04T09:52:03.2163355Z TORCH_NVCC_FLAGS=-Xfatbin -compress-all 2025-12-04T09:52:03.2163609Z SCCACHE_IDLE_TIMEOUT=0 2025-12-04T09:52:03.2163995Z SCRIBE_GRAPHQL_ACCESS_TOKEN=*** 2025-12-04T09:52:03.2164208Z GITHUB_TRIGGERING_ACTOR=huydhn 2025-12-04T09:52:03.2164430Z GITHUB_REF_TYPE=branch 2025-12-04T09:52:03.2164700Z TORCH_CUDA_ARCH_LIST=8.9 2025-12-04T09:52:03.2164927Z BASE_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:52:03.2165161Z XLA_CUDA= 2025-12-04T09:52:03.2165317Z NCCL_LIB_DIR=/usr/local/cuda/lib64/ 2025-12-04T09:52:03.2165724Z HUGGING_FACE_HUB_TOKEN=*** 2025-12-04T09:52:03.2165960Z *** 2025-12-04T09:52:03.2166118Z GITHUB_REPOSITORY_ID=65600975 2025-12-04T09:52:03.2166312Z GITHUB_ACTIONS=true 2025-12-04T09:52:03.2166488Z NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T09:52:03.2166738Z SCCACHE_ERROR_LOG=/var/lib/jenkins/sccache_error.log 2025-12-04T09:52:03.2167014Z SHA1=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:52:03.2167281Z GITHUB_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:52:03.2167655Z GITHUB_WORKFLOW_REF=pytorch/pytorch/.github/workflows/periodic.yml@refs/heads/main 2025-12-04T09:52:03.2167994Z UCC_HOME=/usr 2025-12-04T09:52:03.2168153Z TORCH_SERIALIZATION_DEBUG=1 2025-12-04T09:52:03.2168463Z VERBOSE_TEST_LOGS=False 2025-12-04T09:52:03.2168784Z GITHUB_REF=refs/heads/main 2025-12-04T09:52:03.2169082Z SHARD_NUMBER=1 2025-12-04T09:52:03.2169334Z GITHUB_REF_PROTECTED=true 2025-12-04T09:52:03.2169614Z HOME=/var/lib/jenkins 2025-12-04T09:52:03.2169888Z GITHUB_API_URL=https://api.github.com 2025-12-04T09:52:03.2170302Z PYTORCH_TEST_RERUN_DISABLED_TESTS=1 2025-12-04T09:52:03.2170680Z UCX_COMMIT=7836b165abdbe468a2f607e7254011c07d788152 2025-12-04T09:52:03.2171021Z USE_SYSTEM_NCCL=1 2025-12-04T09:52:03.2171296Z NUM_TEST_SHARDS=7 2025-12-04T09:52:03.2171477Z UCX_HOME=/usr 2025-12-04T09:52:03.2171884Z GITHUB_STATE=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/save_state_e84e326d-7eba-4856-a6d5-381fb8e09f2f 2025-12-04T09:52:03.2172854Z JOB_NAME=linux-jammy-cuda12.8-py3.10-gcc11-debug / test (default, 1, 7, linux.g6.4xlarge.experimental.nvidia.gpu, oncall:debug-build, rerun_disabled... 2025-12-04T09:52:03.2173744Z GITHUB_ENV=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_env_e84e326d-7eba-4856-a6d5-381fb8e09f2f 2025-12-04T09:52:03.2174314Z GITHUB_EVENT_PATH=/home/ec2-user/actions-runner/_work/_temp/_github_workflow/event.json 2025-12-04T09:52:03.2174660Z GITHUB_EVENT_NAME=schedule 2025-12-04T09:52:03.2174845Z DASHBOARD_TAG= 2025-12-04T09:52:03.2175009Z GITHUB_RUN_ID=19922826259 2025-12-04T09:52:03.2175194Z INSTALLED_OPENBLAS= 2025-12-04T09:52:03.2175620Z GITHUB_STEP_SUMMARY=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/step_summary_e84e326d-7eba-4856-a6d5-381fb8e09f2f 2025-12-04T09:52:03.2176088Z GITHUB_ACTOR=huydhn 2025-12-04T09:52:03.2176245Z PR_NUMBER= 2025-12-04T09:52:03.2176383Z DESIRED_CUDA=12.8.1 2025-12-04T09:52:03.2176547Z GITHUB_RUN_ATTEMPT=1 2025-12-04T09:52:03.2176718Z VALGRIND=ON 2025-12-04T09:52:03.2176871Z ANACONDA_PYTHON_VERSION=3.10 2025-12-04T09:52:03.2177112Z GITHUB_GRAPHQL_URL=https://api.github.com/graphql 2025-12-04T09:52:03.2177353Z TERM=vt100 2025-12-04T09:52:03.2177503Z INSTALLED_VISION=yes 2025-12-04T09:52:03.2177671Z BRANCH=main 2025-12-04T09:52:03.2177825Z SCCACHE_REGION=us-east-1 2025-12-04T09:52:03.2178014Z OPENSSL_ROOT_DIR=/opt/openssl 2025-12-04T09:52:03.2178205Z BUILD_AOT_INDUCTOR_TEST= 2025-12-04T09:52:03.2178387Z CUDA_PATH=/usr/local/cuda 2025-12-04T09:52:03.2178756Z GITHUB_ACTION_PATH=/home/ec2-user/actions-runner/_work/pytorch/pytorch/./.github/actions/setup-linux 2025-12-04T09:52:03.2179153Z GITHUB_SERVER_URL=https://github.com 2025-12-04T09:52:03.2179404Z UCC_COMMIT=430e241bf5d38cbc73fc7a6b89155397232e3f96 2025-12-04T09:52:03.2179641Z REENABLED_ISSUES= 2025-12-04T09:52:03.2179799Z DOCS= 2025-12-04T09:52:03.2179937Z SHLVL=1 2025-12-04T09:52:03.2180086Z MAX_JOBS=14 2025-12-04T09:52:03.2180230Z GITHUB_ACTOR_ID=475357 2025-12-04T09:52:03.2180471Z GITHUB_WORKFLOW_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T09:52:03.2180751Z GITHUB_REF_NAME=main 2025-12-04T09:52:03.2181016Z XLA_CLANG_CACHE_S3_BUCKET_NAME=ossci-compiler-clang-cache-circleci-xla 2025-12-04T09:52:03.2181308Z GITHUB_JOB=test 2025-12-04T09:52:03.2181464Z NO_TEST_TIMEOUT=False 2025-12-04T09:52:03.2181630Z TD_DISTRIBUTED=False 2025-12-04T09:52:03.2181807Z GITHUB_REPOSITORY=pytorch/pytorch 2025-12-04T09:52:03.2182018Z GITHUB_RETENTION_DAYS=90 2025-12-04T09:52:03.2182201Z OPENSSL_DIR=/opt/openssl 2025-12-04T09:52:03.2182378Z GITHUB_ACTION_REPOSITORY= 2025-12-04T09:52:03.2182919Z PATH=/opt/cache/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-12-04T09:52:03.2183469Z GITHUB_BASE_REF= 2025-12-04T09:52:03.2183619Z INSTALLED_ACL= 2025-12-04T09:52:03.2183945Z ARTIFACTS_FILE_SUFFIX=test-default-1-7-linux.g6.4xlarge.experimental.nvidia.gpu_57120265563 2025-12-04T09:52:03.2184324Z CI=true 2025-12-04T09:52:03.2184489Z GITHUB_REPOSITORY_OWNER=pytorch 2025-12-04T09:52:03.2184722Z RUST_LOG=sccache::server=error 2025-12-04T09:52:03.2184913Z JOB_ID=57120265563 2025-12-04T09:52:03.2185069Z GITHUB_HEAD_REF= 2025-12-04T09:52:03.2185219Z GITHUB_ACTION_REF= 2025-12-04T09:52:03.2185419Z SCCACHE_BUCKET=ossci-compiler-cache-circleci-v2 2025-12-04T09:52:03.2185663Z TEST_SHOWLOCALS=False 2025-12-04T09:52:03.2185833Z GITHUB_WORKFLOW=periodic 2025-12-04T09:52:03.2186021Z DEBIAN_FRONTEND=noninteractive 2025-12-04T09:52:03.2186460Z GITHUB_OUTPUT=/home/ec2-user/actions-runner/_work/_temp/_runner_file_commands/set_output_e84e326d-7eba-4856-a6d5-381fb8e09f2f 2025-12-04T09:52:03.2186900Z NO_TD=False 2025-12-04T09:52:03.2187070Z SKIP_SCCACHE_INITIALIZATION=1 2025-12-04T09:52:03.2187412Z NCCL_INCLUDE_DIR=/usr/local/cuda/include/ 2025-12-04T09:52:03.2187721Z OLDPWD=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/numba/cuda 2025-12-04T09:52:03.2188012Z _=/usr/bin/env 2025-12-04T09:52:03.2188274Z + echo 'Testing pytorch' 2025-12-04T09:52:03.2188457Z Testing pytorch 2025-12-04T09:52:03.2188731Z + export LANG=C.UTF-8 2025-12-04T09:52:03.2188996Z + LANG=C.UTF-8 2025-12-04T09:52:03.2189164Z + PR_NUMBER= 2025-12-04T09:52:03.2189320Z + [[ default == \d\e\f\a\u\l\t ]] 2025-12-04T09:52:03.2189524Z + export CUDA_VISIBLE_DEVICES=0 2025-12-04T09:52:03.2189725Z + CUDA_VISIBLE_DEVICES=0 2025-12-04T09:52:03.2189903Z + export HIP_VISIBLE_DEVICES=0 2025-12-04T09:52:03.2190096Z + HIP_VISIBLE_DEVICES=0 2025-12-04T09:52:03.2190281Z + [[ default == \d\i\s\t\r\i\b\u\t\e\d ]] 2025-12-04T09:52:03.2190487Z + [[ default == \s\l\o\w ]] 2025-12-04T09:52:03.2190756Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *slow-gradcheck* ]] 2025-12-04T09:52:03.2191097Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *cuda* ]] 2025-12-04T09:52:03.2191380Z + export PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda 2025-12-04T09:52:03.2191615Z + PYTORCH_TESTING_DEVICE_ONLY_FOR=cuda 2025-12-04T09:52:03.2191828Z + [[ default == *crossref* ]] 2025-12-04T09:52:03.2192072Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *rocm* ]] 2025-12-04T09:52:03.2192362Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *xpu* ]] 2025-12-04T09:52:03.2192684Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug != *-bazel-* ]] 2025-12-04T09:52:03.2192948Z + pip_install ninja==1.10.2 2025-12-04T09:52:03.2193193Z + pip_install_pkg='python3 -m pip install --progress-bar off' 2025-12-04T09:52:03.2193507Z + python3 -m pip install --progress-bar off ninja==1.10.2 2025-12-04T09:52:03.6965589Z Collecting ninja==1.10.2 2025-12-04T09:52:03.7178108Z Downloading ninja-1.10.2-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl.metadata (5.0 kB) 2025-12-04T09:52:03.7419690Z Downloading ninja-1.10.2-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (108 kB) 2025-12-04T09:52:04.0794203Z Installing collected packages: ninja 2025-12-04T09:52:04.0795053Z Attempting uninstall: ninja 2025-12-04T09:52:04.0801690Z Found existing installation: ninja 1.11.1.4 2025-12-04T09:52:04.0823829Z Uninstalling ninja-1.11.1.4: 2025-12-04T09:52:04.0949666Z Successfully uninstalled ninja-1.11.1.4 2025-12-04T09:52:04.1582580Z Successfully installed ninja-1.10.2 2025-12-04T09:52:04.2046151Z + export PATH=/var/lib/jenkins/.local/bin:/opt/cache/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-12-04T09:52:04.2047794Z + PATH=/var/lib/jenkins/.local/bin:/opt/cache/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/opt/conda/envs/py_3.10/bin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin 2025-12-04T09:52:04.2048717Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *aarch64* ]] 2025-12-04T09:52:04.2049281Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *asan* ]] 2025-12-04T09:52:04.2049953Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *-debug* ]] 2025-12-04T09:52:04.2050750Z + echo 'We are in debug mode: linux-jammy-cuda12.8-py3.10-gcc11-debug. Expect the python assertion to fail' 2025-12-04T09:52:04.2051606Z We are in debug mode: linux-jammy-cuda12.8-py3.10-gcc11-debug. Expect the python assertion to fail 2025-12-04T09:52:04.2051982Z + cd test 2025-12-04T09:52:04.2052272Z + get_exit_code python -c 'import torch; torch._C._crash_if_debug_asserts_fail(424242)' 2025-12-04T09:52:04.2052597Z + set +e 2025-12-04T09:52:04.2052829Z + python -c 'import torch; torch._C._crash_if_debug_asserts_fail(424242)' 2025-12-04T09:52:05.4959455Z Traceback (most recent call last): 2025-12-04T09:52:05.4960626Z File "", line 1, in 2025-12-04T09:52:05.4963867Z RuntimeError: THPUtils_unpackInt(arg) != 424242 INTERNAL ASSERT FAILED at "/var/lib/jenkins/workspace/torch/csrc/Module.cpp":325, please report a bug to PyTorch. Expect anything but 424242 as an input for debug builds 2025-12-04T09:52:05.7313623Z + retcode=1 2025-12-04T09:52:05.7313991Z + set -e 2025-12-04T09:52:05.7314286Z + return 1 2025-12-04T09:52:05.7316315Z + [[ default == \n\o\g\p\u\_\N\O\_\A\V\X\2 ]] 2025-12-04T09:52:05.7317279Z + [[ default == \n\o\g\p\u\_\A\V\X\5\1\2 ]] 2025-12-04T09:52:05.7317720Z + [[ default == \l\e\g\a\c\y\_\n\v\i\d\i\a\_\d\r\i\v\e\r ]] 2025-12-04T09:52:05.7322389Z + DYNAMO_BENCHMARK_FLAGS=() 2025-12-04T09:52:05.7323043Z + [[ default == *pr_time_benchmarks* ]] 2025-12-04T09:52:05.7323423Z + [[ default == *dynamo_eager* ]] 2025-12-04T09:52:05.7323769Z + [[ default == *aot_eager* ]] 2025-12-04T09:52:05.7324087Z + [[ default == *aot_inductor* ]] 2025-12-04T09:52:05.7324440Z + [[ default == *max_autotune_inductor* ]] 2025-12-04T09:52:05.7324730Z + [[ default == *inductor* ]] 2025-12-04T09:52:05.7324924Z + [[ default == *dynamic* ]] 2025-12-04T09:52:05.7325125Z + [[ default == *cpu* ]] 2025-12-04T09:52:05.7325305Z + [[ default == *xpu* ]] 2025-12-04T09:52:05.7325527Z + DYNAMO_BENCHMARK_FLAGS+=(--device cuda) 2025-12-04T09:52:05.7468799Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *libtorch* ]] 2025-12-04T09:52:05.7469258Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *-bazel-* ]] 2025-12-04T09:52:05.7472115Z + cd test 2025-12-04T09:52:05.7472734Z + python -c 'import torch; print(torch.__config__.show())' 2025-12-04T09:52:07.2965757Z PyTorch built with: 2025-12-04T09:52:07.2966115Z - GCC 11.4 2025-12-04T09:52:07.2966325Z - C++ Version: 201703 2025-12-04T09:52:07.2966842Z - Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications 2025-12-04T09:52:07.2967498Z - Intel(R) MKL-DNN v3.7.1 (Git Hash 8d263e693366ef8db40acc569cc7d8edf644556d) 2025-12-04T09:52:07.2967880Z - OpenMP 201511 (a.k.a. OpenMP 4.5) 2025-12-04T09:52:07.2968189Z - LAPACK is enabled (usually provided by MKL) 2025-12-04T09:52:07.2968478Z - NNPACK is enabled 2025-12-04T09:52:07.2968714Z - CPU capability usage: AVX2 2025-12-04T09:52:07.2968954Z - CUDA Runtime 12.8 2025-12-04T09:52:07.2969263Z - NVCC architecture flags: -gencode;arch=compute_89,code=sm_89 2025-12-04T09:52:07.2969632Z - CuDNN 91.0.2 (built against CUDA 12.9) 2025-12-04T09:52:07.2973285Z - Build settings: BLAS_INFO=mkl, BUILD_TYPE=RelWithAssert, COMMIT_SHA=ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32, CUDA_VERSION=12.8, CUDNN_VERSION=9.10.2, CXX_COMPILER=/opt/cache/bin/c++, CXX_FLAGS= -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Werror -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, FORCE_FALLBACK_CUDA_MPI=1, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, TORCH_VERSION=2.10.0, USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=ON, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=ON, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, USE_ROCM_KERNEL_ASSERT=OFF, USE_XCCL=OFF, USE_XPU=OFF, 2025-12-04T09:52:07.2976748Z 2025-12-04T09:52:07.5833223Z + cd test 2025-12-04T09:52:07.5833642Z + python -c 'import torch; print(torch.__config__.parallel_info())' 2025-12-04T09:52:08.8154027Z ATen/Parallel: 2025-12-04T09:52:08.8154449Z at::get_num_threads() : 8 2025-12-04T09:52:08.8154741Z at::get_num_interop_threads() : 8 2025-12-04T09:52:08.8155017Z OpenMP 201511 (a.k.a. OpenMP 4.5) 2025-12-04T09:52:08.8155617Z omp_get_max_threads() : 8 2025-12-04T09:52:08.8156157Z Intel(R) oneAPI Math Kernel Library Version 2024.2-Product Build 20240605 for Intel(R) 64 architecture applications 2025-12-04T09:52:08.8156691Z mkl_get_max_threads() : 8 2025-12-04T09:52:08.8157032Z Intel(R) MKL-DNN v3.7.1 (Git Hash 8d263e693366ef8db40acc569cc7d8edf644556d) 2025-12-04T09:52:08.8157427Z std::thread::hardware_concurrency() : 16 2025-12-04T09:52:08.8157707Z Environment variables: 2025-12-04T09:52:08.8158343Z OMP_NUM_THREADS : [not set] 2025-12-04T09:52:08.8158611Z MKL_NUM_THREADS : [not set] 2025-12-04T09:52:08.8159055Z ATen parallel backend: OpenMP 2025-12-04T09:52:08.8159218Z 2025-12-04T09:52:09.0483327Z + [[ default == *numpy_2* ]] 2025-12-04T09:52:09.0483854Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *aarch64* ]] 2025-12-04T09:52:09.0484352Z + [[ default == *backward* ]] 2025-12-04T09:52:09.0484750Z + [[ default == *libtorch_agnostic_targetting* ]] 2025-12-04T09:52:09.0485128Z + [[ default == *xla* ]] 2025-12-04T09:52:09.0485424Z + [[ default == *vllm* ]] 2025-12-04T09:52:09.0485746Z + [[ default == *executorch* ]] 2025-12-04T09:52:09.0486087Z + [[ default == \j\i\t\_\l\e\g\a\c\y ]] 2025-12-04T09:52:09.0486454Z + [[ default == \q\u\a\n\t\i\z\a\t\i\o\n ]] 2025-12-04T09:52:09.0486936Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *libtorch* ]] 2025-12-04T09:52:09.0487397Z + [[ default == distributed ]] 2025-12-04T09:52:09.0487738Z + [[ default == *operator_benchmark* ]] 2025-12-04T09:52:09.0488150Z + [[ default == *operator_microbenchmark* ]] 2025-12-04T09:52:09.0488600Z + [[ default == *attention_microbenchmark* ]] 2025-12-04T09:52:09.0489016Z + [[ default == *inductor_distributed* ]] 2025-12-04T09:52:09.0489275Z + [[ default == *inductor-halide* ]] 2025-12-04T09:52:09.0489514Z + [[ default == *inductor-pallas* ]] 2025-12-04T09:52:09.0489734Z + [[ default == *inductor-triton-cpu* ]] 2025-12-04T09:52:09.0489971Z + [[ default == *inductor-micro-benchmark* ]] 2025-12-04T09:52:09.0490226Z + [[ default == *aoti_cross_compile_for_windows* ]] 2025-12-04T09:52:09.0490486Z + [[ default == *huggingface* ]] 2025-12-04T09:52:09.0490681Z + [[ default == *timm* ]] 2025-12-04T09:52:09.0490865Z + [[ default == cachebench ]] 2025-12-04T09:52:09.0491062Z + [[ default == verify_cachebench ]] 2025-12-04T09:52:09.0491261Z + [[ default == *torchbench* ]] 2025-12-04T09:52:09.0491469Z + [[ default == *inductor_cpp_wrapper* ]] 2025-12-04T09:52:09.0491692Z + [[ default == *inductor_core* ]] 2025-12-04T09:52:09.0491902Z + [[ default == *inductor* ]] 2025-12-04T09:52:09.0492095Z + [[ default == *einops* ]] 2025-12-04T09:52:09.0492308Z + [[ default == *dynamo_core* ]] 2025-12-04T09:52:09.0492507Z + [[ default == *dynamo_wrapped* ]] 2025-12-04T09:52:09.0492764Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *rocm* ]] 2025-12-04T09:52:09.0493022Z + [[ 1 == 1 ]] 2025-12-04T09:52:09.0493179Z + [[ 7 -gt 1 ]] 2025-12-04T09:52:09.0493366Z + test_lazy_tensor_meta_reference_disabled 2025-12-04T09:52:09.0493660Z + export TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1 2025-12-04T09:52:09.0493968Z + TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE=1 2025-12-04T09:52:09.0494271Z + echo 'Testing lazy tensor operations without meta reference' 2025-12-04T09:52:09.0494601Z Testing lazy tensor operations without meta reference 2025-12-04T09:52:09.0494950Z + python test/run_test.py --include lazy/test_ts_opinfo.py --verbose 2025-12-04T09:52:13.3693628Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T09:52:13.4263742Z Ignoring disabled issues: [''] 2025-12-04T09:52:13.4341184Z Found test times from artifacts 2025-12-04T09:52:13.4659434Z Found test times from artifacts 2025-12-04T09:52:13.4669386Z Running all tests 2025-12-04T09:52:13.4672246Z Running parallel tests on 3 processes 2025-12-04T09:52:13.4672799Z Name: tests to run (est. time: 0.01min) 2025-12-04T09:52:13.4673079Z Serial tests (0): 2025-12-04T09:52:13.4673303Z Parallel tests (1): 2025-12-04T09:52:13.4673585Z lazy/test_ts_opinfo 1/1 2025-12-04T09:52:13.4673977Z Name: excluded (est. time: 0.0min) 2025-12-04T09:52:13.4674327Z Serial tests (0): 2025-12-04T09:52:13.4674542Z Parallel tests (0): 2025-12-04T09:52:13.4674980Z Running lazy/test_ts_opinfo 1/1 ... [2025-12-04 09:52:13.467280][810.101756881] 2025-12-04T09:52:13.4675424Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:52:13.4679295Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'lazy/test_ts_opinfo.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:52:13.467642] 2025-12-04T09:52:17.4324926Z 2025-12-04T09:52:17.4326075Z lazy/test_ts_opinfo 1/1 was successful, full logs can be found in artifacts with path test/test-reports/lazy.test_ts_opinfo_1.1_4f78b575fb718f5e_.log 2025-12-04T09:52:17.4326803Z Running 0 items in this shard: 2025-12-04T09:52:17.4326983Z 2025-12-04T09:52:17.4327219Z Finished lazy/test_ts_opinfo 1/1 ... [2025-12-04 09:52:17.432424][814.066898332], took 0.07min 2025-12-04T09:52:17.4333577Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/lazy.test_ts_opinfo/lazy.test_ts_opinfo-dfb44ef243b54b76.xml 2025-12-04T09:52:17.8359888Z Uploading artifacts took 0.12 seconds 2025-12-04T09:52:20.8770893Z Running lazy/test_ts_opinfo 1/1 ... [2025-12-04 09:52:20.876646][817.511119577] 2025-12-04T09:52:20.8771372Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:52:20.8774143Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'lazy/test_ts_opinfo.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:52:20.877083] 2025-12-04T09:52:25.0513344Z 2025-12-04T09:52:25.0514330Z lazy/test_ts_opinfo 1/1 was successful, full logs can be found in artifacts with path test/test-reports/lazy.test_ts_opinfo_1.1_b1841d9006e1882f_.log 2025-12-04T09:52:25.0514933Z Running 0 items in this shard: 2025-12-04T09:52:25.0515079Z 2025-12-04T09:52:25.0515286Z Finished lazy/test_ts_opinfo 1/1 ... [2025-12-04 09:52:25.051254][821.685727035], took 0.07min 2025-12-04T09:52:25.0527857Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/lazy.test_ts_opinfo/lazy.test_ts_opinfo-f86a1ea8b3ea1cce.xml 2025-12-04T09:52:25.9503183Z Running test batch 'tests to run' cost 12.48 seconds 2025-12-04T09:52:26.5227128Z 2025-12-04T09:52:26.5227587Z real 0m17.474s 2025-12-04T09:52:26.5227818Z user 0m23.577s 2025-12-04T09:52:26.5227978Z sys 0m9.871s 2025-12-04T09:52:26.5228235Z + export -n TORCH_DISABLE_FUNCTIONALIZATION_META_REFERENCE 2025-12-04T09:52:26.5228531Z + test_without_numpy 2025-12-04T09:52:26.5231414Z ++ dirname .ci/pytorch/test.sh 2025-12-04T09:52:26.5245744Z + pushd .ci/pytorch 2025-12-04T09:52:26.5246293Z ~/workspace/.ci/pytorch ~/workspace 2025-12-04T09:52:26.5246898Z + python -c 'import sys;sys.path.insert(0, '\''fake_numpy'\'');from unittest import TestCase;import torch;x=torch.randn(3,3);TestCase().assertRaises(RuntimeError, lambda: x.numpy())' 2025-12-04T09:52:27.2831924Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_subclasses/functional_tensor.py:283: UserWarning: Failed to initialize NumPy: Sorry PyTorch, but our NumPy is in the other folder (Triggered internally at /var/lib/jenkins/workspace/torch/csrc/utils/tensor_numpy.cpp:84.) 2025-12-04T09:52:27.2833248Z cpu = _conversion_method_template(device=torch.device("cpu")) 2025-12-04T09:52:27.9051896Z + python -c 'import sys;sys.path.insert(0, '\''fake_numpy'\'');import torch;print(torch.tensor([torch.tensor(0.), torch.tensor(1.)]))' 2025-12-04T09:52:28.6689561Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_subclasses/functional_tensor.py:283: UserWarning: Failed to initialize NumPy: Sorry PyTorch, but our NumPy is in the other folder (Triggered internally at /var/lib/jenkins/workspace/torch/csrc/utils/tensor_numpy.cpp:84.) 2025-12-04T09:52:28.6691398Z cpu = _conversion_method_template(device=torch.device("cpu")) 2025-12-04T09:52:29.0698815Z tensor([0., 1.]) 2025-12-04T09:52:29.2935534Z + [[ default == *dynamo_wrapped* ]] 2025-12-04T09:52:29.2935996Z + python -c 'import sys;sys.path.insert(0, '\''fake_numpy'\'');import torch; import torch.onnx' 2025-12-04T09:52:30.0552143Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_subclasses/functional_tensor.py:283: UserWarning: Failed to initialize NumPy: Sorry PyTorch, but our NumPy is in the other folder (Triggered internally at /var/lib/jenkins/workspace/torch/csrc/utils/tensor_numpy.cpp:84.) 2025-12-04T09:52:30.0553663Z cpu = _conversion_method_template(device=torch.device("cpu")) 2025-12-04T09:52:30.6947917Z + popd 2025-12-04T09:52:30.6948175Z ~/workspace 2025-12-04T09:52:30.6948368Z + install_torchvision 2025-12-04T09:52:30.6948609Z + local orig_preload 2025-12-04T09:52:30.6948820Z + local commit 2025-12-04T09:52:30.6951852Z ++ get_pinned_commit vision 2025-12-04T09:52:30.6952145Z ++ cat .github/ci_commit_pins/vision.txt 2025-12-04T09:52:30.6969403Z + commit=617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:52:30.6969750Z + orig_preload= 2025-12-04T09:52:30.6969952Z + '[' -n '' ']' 2025-12-04T09:52:30.6970255Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *cuda* ]] 2025-12-04T09:52:30.6970852Z + export FORCE_CUDA=1 2025-12-04T09:52:30.6971102Z + FORCE_CUDA=1 2025-12-04T09:52:30.6971300Z + export WITH_CUDA=1 2025-12-04T09:52:30.6971519Z + WITH_CUDA=1 2025-12-04T09:52:30.6972020Z + pip_build_and_install git+https://github.com/pytorch/vision.git@617079d944b0e72632311c30ae2bbdf1168b901e dist/vision 2025-12-04T09:52:30.6972829Z + local build_target=git+https://github.com/pytorch/vision.git@617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:52:30.6973323Z + local wheel_dir=dist/vision 2025-12-04T09:52:30.6973557Z + local found_whl=0 2025-12-04T09:52:30.6973767Z + for file in "${wheel_dir}"/*.whl 2025-12-04T09:52:30.6974027Z + [[ -f dist/vision/*.whl ]] 2025-12-04T09:52:30.6974254Z + '[' 0 == 0 ']' 2025-12-04T09:52:30.6974849Z + python3 -m pip wheel --no-build-isolation --no-deps -w dist/vision git+https://github.com/pytorch/vision.git@617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:52:30.9869799Z Collecting git+https://github.com/pytorch/vision.git@617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:52:30.9873666Z Cloning https://github.com/pytorch/vision.git (to revision 617079d944b0e72632311c30ae2bbdf1168b901e) to /tmp/pip-req-build-d7sp5bm8 2025-12-04T09:52:31.0052039Z Running command git clone --filter=blob:none --quiet https://github.com/pytorch/vision.git /tmp/pip-req-build-d7sp5bm8 2025-12-04T09:52:32.4439112Z Running command git rev-parse -q --verify 'sha^617079d944b0e72632311c30ae2bbdf1168b901e' 2025-12-04T09:52:32.4466361Z Running command git fetch -q https://github.com/pytorch/vision.git 617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:52:32.5519983Z Resolved https://github.com/pytorch/vision.git to commit 617079d944b0e72632311c30ae2bbdf1168b901e 2025-12-04T09:52:34.4755157Z Preparing metadata (pyproject.toml) ... [?25l- \ | done 2025-12-04T09:52:34.4789451Z [?25hBuilding wheels for collected packages: torchvision 2025-12-04T09:53:48.8890934Z Building wheel for torchvision (pyproject.toml) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ | / done 2025-12-04T09:53:48.8920278Z [?25h Created wheel for torchvision: filename=torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl size=1786341 sha256=c77da6b7a69ad475a943754a786d1b1d691cbe613200be47317090798ec0d66c 2025-12-04T09:53:48.8921588Z Stored in directory: /var/lib/jenkins/.cache/pip/wheels/12/b2/29/1f82685c5b5173629e1f36a9b93989ce92ce563e5fb91d27ac 2025-12-04T09:53:48.8957555Z Successfully built torchvision 2025-12-04T09:53:48.9886066Z + for file in "${wheel_dir}"/*.whl 2025-12-04T09:53:48.9886632Z + pip_install_whl dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl 2025-12-04T09:53:48.9887248Z + args=('dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl') 2025-12-04T09:53:48.9887681Z + local args 2025-12-04T09:53:48.9888043Z + [[ dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl == *\ * ]] 2025-12-04T09:53:48.9888480Z + for path in "${args[@]}" 2025-12-04T09:53:48.9889304Z + echo 'Installing dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl' 2025-12-04T09:53:48.9890143Z Installing dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl 2025-12-04T09:53:48.9890863Z + python3 -mpip install --no-index --no-deps dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl 2025-12-04T09:53:49.2883888Z Processing ./dist/vision/torchvision-0.25.0a0+617079d-cp310-cp310-linux_x86_64.whl 2025-12-04T09:53:49.2966078Z Installing collected packages: torchvision 2025-12-04T09:53:49.7214938Z Successfully installed torchvision-0.25.0a0+617079d 2025-12-04T09:53:49.7502163Z + '[' -n '' ']' 2025-12-04T09:53:49.7502465Z + test_python_shard 1 2025-12-04T09:53:49.7502703Z + [[ -z 7 ]] 2025-12-04T09:53:49.7503424Z + python test/run_test.py --exclude-jit-executor --exclude-distributed-tests --exclude-quantization-tests --shard 1 7 --verbose --upload-artifacts-while-running 2025-12-04T09:53:54.1612529Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T09:53:54.1691351Z Found test times from artifacts 2025-12-04T09:53:54.2013673Z Found test times from artifacts 2025-12-04T09:53:54.2022412Z Running all tests 2025-12-04T09:53:54.2516281Z Running parallel tests on 3 processes 2025-12-04T09:53:54.2519850Z Name: tests to run (est. time: 71.03min) 2025-12-04T09:53:54.2520149Z Serial tests (51): 2025-12-04T09:53:54.2520420Z inductor/test_flex_attention 1/6 2025-12-04T09:53:54.2520697Z inductor/test_flex_attention 3/6 2025-12-04T09:53:54.2520957Z inductor/test_flex_attention 4/6 2025-12-04T09:53:54.2521208Z inductor/test_flex_attention 5/6 2025-12-04T09:53:54.2521464Z inductor/test_flex_attention 6/6 2025-12-04T09:53:54.2521737Z test_privateuseone_python_backend 1/1 2025-12-04T09:53:54.2522022Z test_ci_sanity_check_fail 1/1 2025-12-04T09:53:54.2522261Z test_overrides 1/1 2025-12-04T09:53:54.2522507Z inductor/test_max_autotune 1/1 2025-12-04T09:53:54.2522754Z doctests 1/1 2025-12-04T09:53:54.2522977Z inductor/test_cutlass_backend 1/1 2025-12-04T09:53:54.2523250Z inductor/test_benchmark_fusion 1/1 2025-12-04T09:53:54.2523531Z inductor/test_distributed_patterns 1/1 2025-12-04T09:53:54.2523806Z dynamo/test_fake_distributed 1/1 2025-12-04T09:53:54.2524074Z test_sort_and_select 1/1 2025-12-04T09:53:54.2524327Z test_cpp_api_parity 1/1 2025-12-04T09:53:54.2524516Z test_extension_utils 1/1 2025-12-04T09:53:54.2524704Z test_show_pickle 1/1 2025-12-04T09:53:54.2524879Z test_torch 1/1 2025-12-04T09:53:54.2525042Z test_tensorexpr 1/1 2025-12-04T09:53:54.2525213Z test_utils 1/1 2025-12-04T09:53:54.2525385Z test_namedtuple_return_api 1/1 2025-12-04T09:53:54.2525588Z test_fake_tensor 1/1 2025-12-04T09:53:54.2525771Z test_multiprocessing 1/1 2025-12-04T09:53:54.2525954Z test_fx 1/1 2025-12-04T09:53:54.2526120Z test_autograd_fallback 1/1 2025-12-04T09:53:54.2526308Z test_autocast 1/1 2025-12-04T09:53:54.2526485Z test_python_dispatch 1/1 2025-12-04T09:53:54.2526678Z test_jit_disabled 1/1 2025-12-04T09:53:54.2526874Z test_cpp_extensions_mtia_backend 1/1 2025-12-04T09:53:54.2527127Z functorch/test_memory_efficient_fusion 1/1 2025-12-04T09:53:54.2527376Z test_tensor_creation_ops 1/1 2025-12-04T09:53:54.2527592Z test_cpp_extensions_stream_and_event 1/1 2025-12-04T09:53:54.2527812Z test_dispatch 1/1 2025-12-04T09:53:54.2527988Z nn/test_convolution 1/1 2025-12-04T09:53:54.2528187Z test_cpp_extensions_jit 1/1 2025-12-04T09:53:54.2528387Z test_nn 1/1 2025-12-04T09:53:54.2528682Z test_multiprocessing_spawn 1/1 2025-12-04T09:53:54.2528885Z nn/test_pooling 1/1 2025-12-04T09:53:54.2529060Z test_cuda_trace 1/1 2025-12-04T09:53:54.2529242Z test_native_mha 1/1 2025-12-04T09:53:54.2529414Z test_cuda_nvml_based_avail 1/1 2025-12-04T09:53:54.2529621Z test_mobile_optimizer 1/1 2025-12-04T09:53:54.2530074Z test_cuda_primary_ctx 1/1 2025-12-04T09:53:54.2530294Z test_reductions 1/1 2025-12-04T09:53:54.2530604Z test_spectral_ops 1/1 2025-12-04T09:53:54.2530811Z distributions/test_distributions 1/1 2025-12-04T09:53:54.2531034Z test_autoload_disable 1/1 2025-12-04T09:53:54.2531219Z test_autoload_enable 1/1 2025-12-04T09:53:54.2531413Z test_cpp_extensions_aot_ninja 1/1 2025-12-04T09:53:54.2531637Z test_cpp_extensions_aot_no_ninja 1/1 2025-12-04T09:53:54.2531843Z Parallel tests (15): 2025-12-04T09:53:54.2532037Z inductor/test_collective_autotuning 1/1 2025-12-04T09:53:54.2532261Z inductor/test_halide 1/1 2025-12-04T09:53:54.2532450Z inductor/test_aot_inductor_utils 1/1 2025-12-04T09:53:54.2532678Z dynamo/test_graph_region_tracker 1/1 2025-12-04T09:53:54.2532895Z dynamo/test_unittest 1/1 2025-12-04T09:53:54.2533081Z inductor/test_compile 1/1 2025-12-04T09:53:54.2533285Z dynamo/test_functions 1/1 2025-12-04T09:53:54.2533481Z inductor/test_ordered_set 1/1 2025-12-04T09:53:54.2533696Z dynamo/test_install_free_tensors 1/1 2025-12-04T09:53:54.2533960Z inductor/test_torchinductor_codegen_config_overrides 1/1 2025-12-04T09:53:54.2534239Z export/test_passes 1/1 2025-12-04T09:53:54.2534434Z dynamo/test_autograd_function 1/1 2025-12-04T09:53:54.2534641Z inductor/test_codecache 1/1 2025-12-04T09:53:54.2534864Z complex_tensor/test_complex_tensor 2/3 2025-12-04T09:53:54.2535090Z optim/test_lrscheduler 1/1 2025-12-04T09:53:54.2535284Z Name: excluded (est. time: 0.0min) 2025-12-04T09:53:54.2535484Z Serial tests (0): 2025-12-04T09:53:54.2535652Z Parallel tests (0): 2025-12-04T09:53:54.2535953Z Running inductor/test_flex_attention 1/6 ... [2025-12-04 09:53:54.252491][910.886970414] 2025-12-04T09:53:54.2536303Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T09:53:54.2537190Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_flex_attention.py', '--shard-id=1', '--num-shards=6', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 09:53:54.252835] 2025-12-04T10:01:21.7698756Z 2025-12-04T10:01:21.7699452Z PRINTING LOG FILE of inductor/test_flex_attention 1/6 (test/test-reports/inductor.test_flex_attention_1.6_ddac0a72250f3643_.log) 2025-12-04T10:01:21.7757296Z Test results will be stored in test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-8842d0c0a55c3e44.xml 2025-12-04T10:01:21.7758310Z ============================= test session starts ============================== 2025-12-04T10:01:21.7758994Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T10:01:21.7759599Z cachedir: .pytest_cache 2025-12-04T10:01:21.7760324Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T10:01:21.7761121Z rootdir: /var/lib/jenkins/workspace 2025-12-04T10:01:21.7761521Z configfile: pytest.ini 2025-12-04T10:01:21.7762261Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T10:01:21.7763075Z collecting ... collected 763 items 2025-12-04T10:01:21.7763494Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T10:01:21.7789773Z Running 50 items in this shard: test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda, test/inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda 2025-12-04T10:01:21.7814575Z 2025-12-04T10:01:21.7815090Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda PASSED [11.3040s] [ 2%] 2025-12-04T10:01:21.7816225Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.5319s] [ 2%] 2025-12-04T10:01:21.7817323Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.4669s] [ 2%] 2025-12-04T10:01:21.7818405Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.3721s] [ 2%] 2025-12-04T10:01:21.7819486Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.5757s] [ 2%] 2025-12-04T10:01:21.7820584Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.4485s] [ 2%] 2025-12-04T10:01:21.7821659Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.5375s] [ 2%] 2025-12-04T10:01:21.7822740Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.6499s] [ 2%] 2025-12-04T10:01:21.7823809Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.5761s] [ 2%] 2025-12-04T10:01:21.7824879Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.5494s] [ 2%] 2025-12-04T10:01:21.7825949Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.5124s] [ 2%] 2025-12-04T10:01:21.7827016Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.7232s] [ 2%] 2025-12-04T10:01:21.7828173Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.6751s] [ 2%] 2025-12-04T10:01:21.7829256Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.5914s] [ 2%] 2025-12-04T10:01:21.7830351Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.6361s] [ 2%] 2025-12-04T10:01:21.7831421Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.9408s] [ 2%] 2025-12-04T10:01:21.7832509Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.3559s] [ 2%] 2025-12-04T10:01:21.7833595Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.5935s] [ 2%] 2025-12-04T10:01:21.7834680Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.6195s] [ 2%] 2025-12-04T10:01:21.7835732Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.5897s] [ 2%] 2025-12-04T10:01:21.7836824Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.8642s] [ 2%] 2025-12-04T10:01:21.7837937Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.9276s] [ 2%] 2025-12-04T10:01:21.7839018Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.7248s] [ 2%] 2025-12-04T10:01:21.7840189Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.6091s] [ 2%] 2025-12-04T10:01:21.7841273Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.8935s] [ 2%] 2025-12-04T10:01:21.7842348Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.7682s] [ 2%] 2025-12-04T10:01:21.7843419Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.6099s] [ 2%] 2025-12-04T10:01:21.7844611Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.7270s] [ 2%] 2025-12-04T10:01:21.7845793Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.8851s] [ 2%] 2025-12-04T10:01:21.7846872Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.5053s] [ 2%] 2025-12-04T10:01:21.7847958Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.8934s] [ 2%] 2025-12-04T10:01:21.7849028Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.9631s] [ 2%] 2025-12-04T10:01:21.7850096Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.7855s] [ 2%] 2025-12-04T10:01:21.7851181Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.6276s] [ 2%] 2025-12-04T10:01:21.7852260Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.8305s] [ 2%] 2025-12-04T10:01:21.7853327Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.8622s] [ 2%] 2025-12-04T10:01:21.7854395Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.5916s] [ 2%] 2025-12-04T10:01:21.7855775Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.5865s] [ 2%] 2025-12-04T10:01:21.7856864Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [9.0100s] [ 2%] 2025-12-04T10:01:21.7857909Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.9863s] [ 2%] 2025-12-04T10:01:21.7858996Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.5117s] [ 2%] 2025-12-04T10:01:21.7860079Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.8083s] [ 2%] 2025-12-04T10:01:21.7861174Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.7107s] [ 2%] 2025-12-04T10:01:21.7862248Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.7130s] [ 2%] 2025-12-04T10:01:21.7863332Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.7673s] [ 2%] 2025-12-04T10:01:21.7864408Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.9727s] [ 2%] 2025-12-04T10:01:21.7865472Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.8543s] [ 2%] 2025-12-04T10:01:21.7866541Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [9.1040s] [ 2%] 2025-12-04T10:01:21.7867699Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.7166s] [ 2%] 2025-12-04T10:01:21.7868782Z inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda FAILED [8.5727s] [ 2%] 2025-12-04T10:01:21.7869405Z 2025-12-04T10:01:21.7869567Z =================================== FAILURES =================================== 2025-12-04T10:01:21.7870148Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:21.7870709Z Traceback (most recent call last): 2025-12-04T10:01:21.7871447Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:21.7872176Z self.assertTrue( 2025-12-04T10:01:21.7872675Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:21.7873258Z raise self.failureException(msg) 2025-12-04T10:01:21.7874079Z AssertionError: False is not true : Log file /tmp/tmpb0gwypls/flex_attention_configs.json was not created 2025-12-04T10:01:21.7874645Z 2025-12-04T10:01:21.7874874Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:21.7875822Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:21.7876403Z 2025-12-04T10:01:21.7876668Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:21.7877292Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.7877757Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.7878077Z unimplemented [] 2025-12-04T10:01:21.7878413Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.7880812Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:21.7883309Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.7883856Z graph_break [] 2025-12-04T10:01:21.7884224Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.7886261Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:21.7888194Z current_size = base.storage().size() 2025-12-04T10:01:21.7888549Z Autotune Choices Stats: 2025-12-04T10:01:21.7891090Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:21.7893952Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.7894850Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.7895854Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.7898431Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.7902545Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.7906547Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.7910539Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.7914440Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.7918355Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.7920783Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:21.7921448Z Autotune Choices Stats: 2025-12-04T10:01:21.7924066Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:21.7927335Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.7928785Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.7930422Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.7933597Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.7937734Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.7941815Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.7945847Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.7949960Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.7953993Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.7958208Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.7962419Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.7966437Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.7970637Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.7973155Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:21.7973939Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.7974410Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.7974731Z unimplemented [] 2025-12-04T10:01:21.7975069Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.7975671Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.7978150Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:21.7980399Z graph_break [] 2025-12-04T10:01:21.7980768Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.7981224Z Autotune Choices Stats: 2025-12-04T10:01:21.7983762Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:21.7986638Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.7987617Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.7988647Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.7991237Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.7995286Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.7999308Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.8003245Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.8007151Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8011069Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8013501Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:21.8014163Z Autotune Choices Stats: 2025-12-04T10:01:21.8016806Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:21.8020074Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.8021521Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.8023256Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.8026422Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8030522Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.8034571Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.8038624Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8042646Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8046654Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8050681Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.8054815Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.8059240Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8063300Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8065803Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:21.8066660Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:21.8067275Z Traceback (most recent call last): 2025-12-04T10:01:21.8068016Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:21.8068754Z self.assertTrue( 2025-12-04T10:01:21.8069249Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:21.8069836Z raise self.failureException(msg) 2025-12-04T10:01:21.8070488Z AssertionError: False is not true : Log file /tmp/tmp8wnpqbue/flex_attention_configs.json was not created 2025-12-04T10:01:21.8071025Z 2025-12-04T10:01:21.8071243Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:21.8072028Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:21.8072602Z 2025-12-04T10:01:21.8072876Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:21.8073501Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.8073964Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.8074291Z unimplemented [] 2025-12-04T10:01:21.8074626Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.8077009Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:21.8079503Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.8080067Z graph_break [] 2025-12-04T10:01:21.8080432Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.8082621Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:21.8084667Z current_size = base.storage().size() 2025-12-04T10:01:21.8085030Z Autotune Choices Stats: 2025-12-04T10:01:21.8087566Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:21.8090480Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.8091393Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.8092411Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.8094966Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8098865Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8102766Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.8106667Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8110629Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.8114666Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8117205Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:21.8117858Z Autotune Choices Stats: 2025-12-04T10:01:21.8120489Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:21.8123764Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.8125206Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.8126830Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.8129893Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8133946Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.8138008Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8142155Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.8146236Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8150296Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8154308Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.8158474Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.8162532Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8166535Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8169028Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:21.8169801Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.8170271Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.8170585Z unimplemented [] 2025-12-04T10:01:21.8170917Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.8171524Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.8174189Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:21.8176582Z graph_break [] 2025-12-04T10:01:21.8176967Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.8177433Z Autotune Choices Stats: 2025-12-04T10:01:21.8179987Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:21.8182883Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.8183777Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.8184808Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.8187450Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8191392Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8195291Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.8199203Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.8203202Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8207257Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8209713Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:21.8210369Z Autotune Choices Stats: 2025-12-04T10:01:21.8260018Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:21.8263418Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.8264900Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.8266544Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.8269697Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8273750Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.8277805Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.8282088Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8286307Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8290368Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8294415Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.8298444Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.8302458Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8306480Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8309090Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:21.8309877Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.8310349Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.8310777Z unimplemented [] 2025-12-04T10:01:21.8311112Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.8311814Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.8314327Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:21.8316610Z graph_break [] 2025-12-04T10:01:21.8316985Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.8317446Z Autotune Choices Stats: 2025-12-04T10:01:21.8320022Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:21.8322920Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.8323842Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.8324869Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.8327422Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8331349Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8335265Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.8339303Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.8343319Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8347300Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8349756Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:21.8350407Z Autotune Choices Stats: 2025-12-04T10:01:21.8353031Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:21.8356522Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.8357974Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.8359643Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.8362730Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8366800Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.8374256Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.8378460Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8382511Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8386574Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8390688Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.8394745Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.8398826Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:21.8402992Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8405525Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:21.8406470Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:21.8407020Z Traceback (most recent call last): 2025-12-04T10:01:21.8407744Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:21.8408483Z self.assertTrue( 2025-12-04T10:01:21.8408979Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:21.8409549Z raise self.failureException(msg) 2025-12-04T10:01:21.8410194Z AssertionError: False is not true : Log file /tmp/tmp_npbvc1j/flex_attention_configs.json was not created 2025-12-04T10:01:21.8410744Z 2025-12-04T10:01:21.8410959Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:21.8411740Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:21.8412311Z 2025-12-04T10:01:21.8412592Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:21.8413194Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.8413671Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.8413980Z unimplemented [] 2025-12-04T10:01:21.8414308Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.8416697Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:21.8419172Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.8419729Z graph_break [] 2025-12-04T10:01:21.8420096Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.8422124Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:21.8424061Z current_size = base.storage().size() 2025-12-04T10:01:21.8424399Z Autotune Choices Stats: 2025-12-04T10:01:21.8426952Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:21.8429950Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.8430850Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.8431880Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.8434585Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8438636Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8442528Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.8446461Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8450374Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.8454314Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8456898Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:21.8457561Z Autotune Choices Stats: 2025-12-04T10:01:21.8460202Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:21.8463663Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.8465108Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.8466919Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.8470046Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8474139Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.8478221Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8482296Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.8486350Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8490444Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8494595Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.8498763Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.8502903Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8507415Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8510175Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:21.8511170Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.8511737Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.8512274Z unimplemented [] 2025-12-04T10:01:21.8512714Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.8513568Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.8516196Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:21.8518681Z graph_break [] 2025-12-04T10:01:21.8519248Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.8519944Z Autotune Choices Stats: 2025-12-04T10:01:21.8522745Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:21.8525874Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.8527001Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.8528345Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.8531218Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8535460Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8539638Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.8543728Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.8547912Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8551982Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8554645Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:21.8555613Z Autotune Choices Stats: 2025-12-04T10:01:21.8558531Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:21.8562262Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.8563904Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.8565778Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.8569008Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8573378Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.8577616Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.8582018Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8586365Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8590839Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8595252Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.8599447Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.8603787Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8608230Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8610964Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:21.8611925Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.8612646Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.8613197Z unimplemented [] 2025-12-04T10:01:21.8613609Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.8614404Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.8617090Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:21.8619598Z graph_break [] 2025-12-04T10:01:21.8620101Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.8620648Z Autotune Choices Stats: 2025-12-04T10:01:21.8623499Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:21.8626624Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.8627828Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.8629193Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.8631937Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8636028Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8640143Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.8644226Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.8648374Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8652473Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8655343Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:21.8656217Z Autotune Choices Stats: 2025-12-04T10:01:21.8659079Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:21.8662932Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.8664548Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.8666342Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.8669821Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8674221Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.8678443Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.8682692Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8686952Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8691456Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8695901Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.8700198Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.8704506Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:21.8708837Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8711617Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:21.8712620Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.8713180Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.8713733Z unimplemented [] 2025-12-04T10:01:21.8714215Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.8714961Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.8717695Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:21.8720314Z graph_break [] 2025-12-04T10:01:21.8720817Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.8721452Z Autotune Choices Stats: 2025-12-04T10:01:21.8724235Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:21.8727538Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.8728699Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.8779835Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.8782806Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8787069Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8791468Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.8795646Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.8799916Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8804303Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8807177Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:21.8807981Z Autotune Choices Stats: 2025-12-04T10:01:21.8810760Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:21.8814367Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.8816059Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.8817929Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.8821318Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8825593Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8829934Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.8834275Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.8839590Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8843879Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8848254Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.8852476Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.8856880Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8861104Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:21.8863849Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:21.8864839Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:21.8865637Z Traceback (most recent call last): 2025-12-04T10:01:21.8866544Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:21.8867482Z self.assertTrue( 2025-12-04T10:01:21.8868177Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:21.8868988Z raise self.failureException(msg) 2025-12-04T10:01:21.8869915Z AssertionError: False is not true : Log file /tmp/tmpk9u0gh1c/flex_attention_configs.json was not created 2025-12-04T10:01:21.8870579Z 2025-12-04T10:01:21.8870844Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:21.8872217Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:21.8872908Z 2025-12-04T10:01:21.8873439Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:21.8874210Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.8874842Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.8875302Z unimplemented [] 2025-12-04T10:01:21.8875880Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.8878392Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:21.8881125Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.8881784Z graph_break [] 2025-12-04T10:01:21.8882416Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.8884671Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:21.8886849Z current_size = base.storage().size() 2025-12-04T10:01:21.8887361Z Autotune Choices Stats: 2025-12-04T10:01:21.8890126Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:21.8893167Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.8894300Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.8895559Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.8898244Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8902327Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8906605Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.8910899Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8915024Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.8919253Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8921872Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:21.8922788Z Autotune Choices Stats: 2025-12-04T10:01:21.8925509Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:21.8929032Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.8930569Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.8932414Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.8935847Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8940118Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.8944359Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8948645Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.8952876Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8957328Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8961567Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.8965781Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.8970097Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.8974572Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.8977270Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:21.8978290Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.8978856Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.8979390Z unimplemented [] 2025-12-04T10:01:21.8979880Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.8980602Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.8983242Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:21.8985678Z graph_break [] 2025-12-04T10:01:21.8986299Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.8986995Z Autotune Choices Stats: 2025-12-04T10:01:21.8989779Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:21.8992944Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.8993992Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.8995430Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.8998307Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9002677Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9007069Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.9011270Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.9015440Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9019641Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9022261Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:21.9023157Z Autotune Choices Stats: 2025-12-04T10:01:21.9025960Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:21.9029585Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9031292Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9033101Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9036467Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9040966Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9045191Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9049479Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9053738Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9058254Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9062653Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9067100Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9071678Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9076011Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9078713Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:21.9079696Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.9080269Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.9080942Z unimplemented [] 2025-12-04T10:01:21.9081408Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.9082157Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.9084780Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:21.9087286Z graph_break [] 2025-12-04T10:01:21.9087872Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.9088563Z Autotune Choices Stats: 2025-12-04T10:01:21.9091272Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:21.9094377Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9095372Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9096511Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9099516Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9103785Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9108059Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.9112120Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.9116246Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9120372Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9122919Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:21.9123664Z Autotune Choices Stats: 2025-12-04T10:01:21.9126403Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:21.9129947Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9131725Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9133554Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9136969Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9141261Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9145578Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9150054Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9154287Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9158790Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9163037Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9167556Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9171997Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:21.9176331Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9179021Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:21.9179972Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.9180582Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.9181129Z unimplemented [] 2025-12-04T10:01:21.9181530Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.9182382Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.9185039Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:21.9187585Z graph_break [] 2025-12-04T10:01:21.9188143Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.9188784Z Autotune Choices Stats: 2025-12-04T10:01:21.9191475Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:21.9194491Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9195583Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9196730Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9199323Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9203079Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9207122Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.9211085Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.9214852Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9218990Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9221635Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:21.9222547Z Autotune Choices Stats: 2025-12-04T10:01:21.9225385Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:21.9229147Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9230955Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9232829Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9236161Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9240624Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9245027Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9249351Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9253851Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9258411Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9262983Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9267598Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9271942Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9276446Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:21.9279137Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:21.9280080Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.9280810Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.9281247Z unimplemented [] 2025-12-04T10:01:21.9281662Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.9282520Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.9285255Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:21.9287804Z graph_break [] 2025-12-04T10:01:21.9288318Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.9288911Z Autotune Choices Stats: 2025-12-04T10:01:21.9291747Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:21.9294893Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9296106Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9297465Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9300254Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9304478Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9308897Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.9313121Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.9317409Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9321657Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9324314Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:21.9325152Z Autotune Choices Stats: 2025-12-04T10:01:21.9328131Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:21.9331828Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9333521Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9335431Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9339083Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9343553Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9347992Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9352438Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9357115Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9361634Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9366539Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9370909Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9375240Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:21.9379739Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9382496Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:21.9383554Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:21.9384307Z Traceback (most recent call last): 2025-12-04T10:01:21.9385145Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:21.9386130Z self.assertTrue( 2025-12-04T10:01:21.9386790Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:21.9387572Z raise self.failureException(msg) 2025-12-04T10:01:21.9388437Z AssertionError: False is not true : Log file /tmp/tmpa7c6r43z/flex_attention_configs.json was not created 2025-12-04T10:01:21.9389111Z 2025-12-04T10:01:21.9389376Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:21.9390281Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:21.9462367Z 2025-12-04T10:01:21.9462959Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:21.9463729Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.9464388Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.9464947Z unimplemented [] 2025-12-04T10:01:21.9465428Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.9470080Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:21.9473172Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.9473922Z graph_break [] 2025-12-04T10:01:21.9474410Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.9476720Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:21.9478892Z current_size = base.storage().size() 2025-12-04T10:01:21.9479374Z Autotune Choices Stats: 2025-12-04T10:01:21.9482284Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:21.9485419Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9489839Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9491081Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9493855Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9497941Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9501975Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.9506189Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9510497Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.9514553Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9517059Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:21.9517809Z Autotune Choices Stats: 2025-12-04T10:01:21.9520511Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:21.9523919Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9525385Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9527062Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9530222Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9534473Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9539066Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9543455Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9547827Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9552255Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9556780Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9561198Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9565546Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9570022Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9572954Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:21.9573905Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.9574538Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.9575036Z unimplemented [] 2025-12-04T10:01:21.9575515Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.9576246Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.9579092Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:21.9581560Z graph_break [] 2025-12-04T10:01:21.9581986Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.9582722Z Autotune Choices Stats: 2025-12-04T10:01:21.9585510Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:21.9588796Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9589894Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9591088Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9593976Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9598191Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9602481Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.9606826Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.9611015Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9615197Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9617944Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:21.9618766Z Autotune Choices Stats: 2025-12-04T10:01:21.9621647Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:21.9625273Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9626903Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9628848Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9632170Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9636592Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9641134Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9645456Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9649836Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9654182Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9658519Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9661883Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9665242Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9667820Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9669640Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:21.9670208Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.9670546Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.9670925Z unimplemented [] 2025-12-04T10:01:21.9671227Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.9671754Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.9673318Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:21.9674793Z graph_break [] 2025-12-04T10:01:21.9675183Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.9675575Z Autotune Choices Stats: 2025-12-04T10:01:21.9677229Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:21.9679119Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9679724Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9680440Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9682144Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9684582Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9687073Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.9689764Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.9692223Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9694689Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9696277Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:21.9696764Z Autotune Choices Stats: 2025-12-04T10:01:21.9698470Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:21.9700533Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9701460Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9702558Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9704567Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9707311Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9709923Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9712424Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9715027Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9717526Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9720008Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9722565Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9725159Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:21.9727761Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9729383Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:21.9729978Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.9730405Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.9730722Z unimplemented [] 2025-12-04T10:01:21.9731027Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.9731495Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.9733087Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:21.9734496Z graph_break [] 2025-12-04T10:01:21.9734870Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.9735230Z Autotune Choices Stats: 2025-12-04T10:01:21.9736852Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:21.9738739Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9739455Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9740144Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9741856Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9744341Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9746908Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.9749425Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.9751863Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9754358Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9756226Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:21.9756710Z Autotune Choices Stats: 2025-12-04T10:01:21.9758520Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:21.9760575Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9761518Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9762642Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9764727Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9767399Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9769938Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9772444Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9775031Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9777524Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9780082Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9782747Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9785312Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9787996Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:21.9789550Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:21.9790124Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.9790565Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.9790849Z unimplemented [] 2025-12-04T10:01:21.9791129Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.9791649Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.9793280Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:21.9794698Z graph_break [] 2025-12-04T10:01:21.9795074Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.9795421Z Autotune Choices Stats: 2025-12-04T10:01:21.9797027Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:21.9798905Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9799558Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9800287Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9802004Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9804530Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9807071Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.9809492Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.9811887Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9814374Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9815922Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:21.9816352Z Autotune Choices Stats: 2025-12-04T10:01:21.9818098Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:21.9820160Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9821230Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9822369Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9824297Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9826954Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9829548Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9832045Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9834612Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9837113Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9839717Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9842245Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9844823Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:21.9847431Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9849085Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:21.9849634Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.9850102Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.9850354Z unimplemented [] 2025-12-04T10:01:21.9850639Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.9851185Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.9852743Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:21.9854215Z graph_break [] 2025-12-04T10:01:21.9854524Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.9854891Z Autotune Choices Stats: 2025-12-04T10:01:21.9856822Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:21.9858664Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9859343Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9860190Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9861853Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9864408Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9866878Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.9869416Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.9871808Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9874301Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9875865Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:21.9876373Z Autotune Choices Stats: 2025-12-04T10:01:21.9878106Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:21.9880260Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9881346Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9882392Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9884307Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9886904Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9889411Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9891987Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9894502Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9897000Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9899695Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9902282Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9904768Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9907409Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9909014Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:21.9909649Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:21.9910057Z Traceback (most recent call last): 2025-12-04T10:01:21.9910607Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:21.9911180Z self.assertTrue( 2025-12-04T10:01:21.9911607Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:21.9912053Z raise self.failureException(msg) 2025-12-04T10:01:21.9912568Z AssertionError: False is not true : Log file /tmp/tmp2rqaxu_5/flex_attention_configs.json was not created 2025-12-04T10:01:21.9912978Z 2025-12-04T10:01:21.9913144Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:21.9913695Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:21.9914047Z 2025-12-04T10:01:21.9914291Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:21.9914783Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.9915167Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.9915516Z unimplemented [] 2025-12-04T10:01:21.9915780Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.9917328Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:21.9918962Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.9919495Z graph_break [] 2025-12-04T10:01:21.9919792Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.9921160Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:21.9922397Z current_size = base.storage().size() 2025-12-04T10:01:21.9922713Z Autotune Choices Stats: 2025-12-04T10:01:21.9924413Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:21.9926245Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9926826Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9927616Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9929260Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9931710Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9934178Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.9936612Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9939145Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.9941652Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9943175Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:21.9943747Z Autotune Choices Stats: 2025-12-04T10:01:21.9945411Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:21.9947577Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9948570Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9949600Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9951608Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9954142Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9957027Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9959776Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9962303Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9964853Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9967368Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:21.9969867Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:21.9972404Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9974913Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:21.9976599Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:21.9977219Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:21.9977647Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:21.9977912Z unimplemented [] 2025-12-04T10:01:21.9978296Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:21.9978721Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:21.9980260Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:21.9981751Z graph_break [] 2025-12-04T10:01:21.9982066Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:21.9982470Z Autotune Choices Stats: 2025-12-04T10:01:21.9984128Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:21.9985932Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:21.9986628Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:21.9987413Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:21.9989045Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9991542Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:21.9993958Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:21.9996458Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:21.9999004Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0001439Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0003091Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:22.0003578Z Autotune Choices Stats: 2025-12-04T10:01:22.0005236Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.0007353Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0008275Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0009328Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0011290Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0013801Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0016374Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0019037Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0021566Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0024103Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0026631Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0029188Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0031797Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0034361Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0036088Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:22.0036625Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.0037013Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.0037345Z unimplemented [] 2025-12-04T10:01:22.0037621Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.0038134Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.0039731Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.0041173Z graph_break [] 2025-12-04T10:01:22.0041486Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.0041852Z Autotune Choices Stats: 2025-12-04T10:01:22.0043498Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.0045298Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0045975Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0046682Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0048291Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0050846Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0053331Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.0056165Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.0058625Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0061044Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0062649Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:22.0063092Z Autotune Choices Stats: 2025-12-04T10:01:22.0064775Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.0066873Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0067870Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0068922Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0070963Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0073606Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0076256Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0078786Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0081277Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0083833Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0086354Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0088828Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0091477Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.0094090Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0095729Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:22.0096296Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.0096665Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.0096966Z unimplemented [] 2025-12-04T10:01:22.0097281Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.0097745Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.0099358Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.0100764Z graph_break [] 2025-12-04T10:01:22.0101054Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.0101489Z Autotune Choices Stats: 2025-12-04T10:01:22.0103163Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.0104980Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0105671Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0106346Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0108047Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0110538Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0113043Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.0115597Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.0118049Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0120466Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0122123Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:22.0122603Z Autotune Choices Stats: 2025-12-04T10:01:22.0124239Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.0126370Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0127309Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0128432Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0130453Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0133060Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0135786Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0138295Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0140836Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0143409Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0145912Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0148502Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0151107Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0153662Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.0155514Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:22.0156109Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.0156474Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.0156816Z unimplemented [] 2025-12-04T10:01:22.0157158Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.0157592Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.0159196Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.0160610Z graph_break [] 2025-12-04T10:01:22.0160924Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.0161374Z Autotune Choices Stats: 2025-12-04T10:01:22.0162986Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.0164846Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0165464Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0166168Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0167847Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0170449Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0172980Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.0175468Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.0177907Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0180302Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0181923Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:22.0182420Z Autotune Choices Stats: 2025-12-04T10:01:22.0184118Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.0186203Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0187131Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0188417Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0190373Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0191679Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0192896Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0194146Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0195411Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0196612Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0197832Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0199116Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0200427Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.0201669Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0201952Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:22.0202161Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.0202263Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.0202339Z unimplemented [] 2025-12-04T10:01:22.0202575Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.0202788Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.0204095Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.0204204Z graph_break [] 2025-12-04T10:01:22.0204360Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.0204543Z Autotune Choices Stats: 2025-12-04T10:01:22.0206009Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.0206320Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0206585Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0206961Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0208184Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0209524Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0210700Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.0211889Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.0213063Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0214203Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0214584Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:22.0214680Z Autotune Choices Stats: 2025-12-04T10:01:22.0216207Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.0216673Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0217128Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0217916Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0219177Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0220385Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0221624Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0222831Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0224105Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0225349Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0226609Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0228009Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0229211Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0230479Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0230766Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:22.0230972Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.0231075Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.0231197Z unimplemented [] 2025-12-04T10:01:22.0231324Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.0231593Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.0232861Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.0232968Z graph_break [] 2025-12-04T10:01:22.0233206Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.0233301Z Autotune Choices Stats: 2025-12-04T10:01:22.0234812Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.0235108Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0235372Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0235820Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0237054Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0238228Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0239497Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0240671Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.0241857Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.0243017Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0243311Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:22.0243472Z Autotune Choices Stats: 2025-12-04T10:01:22.0244994Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.0245526Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0246018Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0246609Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0247913Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0249183Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0250429Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0251675Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0252881Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0254156Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0255701Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0257077Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0258318Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.0259524Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0259894Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:22.0260130Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:22.0260235Z Traceback (most recent call last): 2025-12-04T10:01:22.0260598Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:22.0260694Z self.assertTrue( 2025-12-04T10:01:22.0260942Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:22.0261101Z raise self.failureException(msg) 2025-12-04T10:01:22.0261398Z AssertionError: False is not true : Log file /tmp/tmpty_oqglf/flex_attention_configs.json was not created 2025-12-04T10:01:22.0261404Z 2025-12-04T10:01:22.0261604Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:22.0261887Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:22.0261895Z 2025-12-04T10:01:22.0262168Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:22.0262315Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.0262462Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.0262617Z unimplemented [] 2025-12-04T10:01:22.0262751Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.0264084Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:22.0264399Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.0264474Z graph_break [] 2025-12-04T10:01:22.0264765Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.0265806Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:22.0265951Z current_size = base.storage().size() 2025-12-04T10:01:22.0266062Z Autotune Choices Stats: 2025-12-04T10:01:22.0267602Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:22.0267933Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0268199Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0268589Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0269761Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0270963Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0272187Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.0273425Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0274679Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.0275908Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0276215Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:22.0276299Z Autotune Choices Stats: 2025-12-04T10:01:22.0277870Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:22.0278358Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0278755Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0279345Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0280575Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0281823Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0283149Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0284436Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0285638Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0286866Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0288112Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0289357Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0290601Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0293199Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0295014Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:22.0295853Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.0296408Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.0297001Z unimplemented [] 2025-12-04T10:01:22.0297653Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.0298187Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.0299798Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.0301386Z graph_break [] 2025-12-04T10:01:22.0301697Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.0302161Z Autotune Choices Stats: 2025-12-04T10:01:22.0303797Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.0305684Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0306408Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0307165Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0309049Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0311580Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0314032Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.0316517Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.0319511Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0322491Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0324660Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:22.0325154Z Autotune Choices Stats: 2025-12-04T10:01:22.0326808Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.0329128Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0330170Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0331572Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0333811Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0336974Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0339716Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0342400Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0344936Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0347605Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0350119Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0352644Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0355169Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0358092Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0360010Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:22.0360581Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.0360942Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.0361318Z unimplemented [] 2025-12-04T10:01:22.0361628Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.0362078Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.0363717Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.0365097Z graph_break [] 2025-12-04T10:01:22.0365396Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.0365847Z Autotune Choices Stats: 2025-12-04T10:01:22.0367484Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.0369343Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0370004Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0370674Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0372357Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0374844Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0377343Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.0379910Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.0382314Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0384727Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0386344Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:22.0386832Z Autotune Choices Stats: 2025-12-04T10:01:22.0388632Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.0390695Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0391689Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0392805Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0394699Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0397313Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0399974Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0402493Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0405033Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0407549Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0410030Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0412586Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0415186Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.0417745Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0419381Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:22.0419939Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.0420263Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.0420620Z unimplemented [] 2025-12-04T10:01:22.0420930Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.0421342Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.0422953Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.0424372Z graph_break [] 2025-12-04T10:01:22.0424713Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.0425158Z Autotune Choices Stats: 2025-12-04T10:01:22.0426736Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.0428690Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0429301Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0429997Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0431679Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0434187Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0436653Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.0439144Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.0441563Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0444066Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0445614Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:22.0446082Z Autotune Choices Stats: 2025-12-04T10:01:22.0447794Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.0449839Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0450748Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0451855Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0453841Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0456736Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0459341Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0461900Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0464448Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0467013Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0469545Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0472136Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0474745Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0477403Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.0478991Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:22.0479509Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.0479936Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.0480267Z unimplemented [] 2025-12-04T10:01:22.0480554Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.0481034Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.0482598Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.0484018Z graph_break [] 2025-12-04T10:01:22.0484371Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.0484701Z Autotune Choices Stats: 2025-12-04T10:01:22.0486331Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.0488195Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0488828Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0489527Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0491251Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0493775Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0496309Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.0498772Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.0501190Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0503681Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0505249Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:22.0505705Z Autotune Choices Stats: 2025-12-04T10:01:22.0507467Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.0509511Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0510463Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0511697Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0513722Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0516289Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0518808Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0521316Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0523873Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0526360Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0528861Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0531497Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0534329Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.0536883Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0538490Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:22.0539036Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.0539454Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.0539749Z unimplemented [] 2025-12-04T10:01:22.0540004Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.0540519Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.0542075Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.0543468Z graph_break [] 2025-12-04T10:01:22.0543875Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.0544241Z Autotune Choices Stats: 2025-12-04T10:01:22.0545848Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.0547814Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0548452Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0549205Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0550883Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0553393Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0556369Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.0558913Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.0561375Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0563858Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0565417Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:22.0565847Z Autotune Choices Stats: 2025-12-04T10:01:22.0567580Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.0569666Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0570747Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0571940Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0573866Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0576447Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0578977Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0581456Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0584022Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0586507Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0589162Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0591741Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0594379Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0596930Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0598495Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:22.0599049Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.0599535Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.0599802Z unimplemented [] 2025-12-04T10:01:22.0600075Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.0600608Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.0602168Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.0603609Z graph_break [] 2025-12-04T10:01:22.0603904Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.0604289Z Autotune Choices Stats: 2025-12-04T10:01:22.0605951Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.0607789Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0608412Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0609182Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0610935Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0613386Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0615872Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0618284Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.0620700Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.0702613Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0704145Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:22.0704554Z Autotune Choices Stats: 2025-12-04T10:01:22.0706319Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.0708488Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0709341Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0710325Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0712178Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0714616Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0717040Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0719454Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0721869Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0724359Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0726765Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0729228Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0731644Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.0734045Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0735548Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:22.0736021Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.0736323Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.0736523Z unimplemented [] 2025-12-04T10:01:22.0736735Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.0737107Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.0738594Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.0739917Z graph_break [] 2025-12-04T10:01:22.0740149Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.0740436Z Autotune Choices Stats: 2025-12-04T10:01:22.0742046Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.0743855Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0744401Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0745016Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0746569Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0748957Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0751286Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.0753605Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.0756081Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0758419Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0759870Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:22.0760271Z Autotune Choices Stats: 2025-12-04T10:01:22.0761974Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.0764064Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0764912Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0765893Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0767746Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0770171Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0772572Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0774984Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0777393Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0779898Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0782364Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0784772Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0787198Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0789642Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.0791138Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:22.0791646Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:22.0791987Z Traceback (most recent call last): 2025-12-04T10:01:22.0792434Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:22.0792874Z self.assertTrue( 2025-12-04T10:01:22.0793193Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:22.0793552Z raise self.failureException(msg) 2025-12-04T10:01:22.0793951Z AssertionError: False is not true : Log file /tmp/tmp46qub0vx/flex_attention_configs.json was not created 2025-12-04T10:01:22.0794274Z 2025-12-04T10:01:22.0794413Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:22.0794881Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:22.0795211Z 2025-12-04T10:01:22.0795382Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:22.0795771Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.0796062Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.0796260Z unimplemented [] 2025-12-04T10:01:22.0796541Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.0797926Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:22.0799450Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.0799783Z graph_break [] 2025-12-04T10:01:22.0800012Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.0801235Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:22.0802387Z current_size = base.storage().size() 2025-12-04T10:01:22.0802617Z Autotune Choices Stats: 2025-12-04T10:01:22.0804171Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:22.0805922Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0806476Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0807088Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0808631Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0810965Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0813298Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.0815701Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0818087Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.0820425Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0821886Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:22.0822281Z Autotune Choices Stats: 2025-12-04T10:01:22.0823850Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:22.0825814Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0826673Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0827701Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0829540Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0832102Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0834596Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0837005Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0839419Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0841834Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0844250Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0846651Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0849059Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0851530Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0853073Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:22.0853526Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.0853811Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.0854003Z unimplemented [] 2025-12-04T10:01:22.0854211Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.0854574Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.0856589Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.0857919Z graph_break [] 2025-12-04T10:01:22.0858146Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.0858419Z Autotune Choices Stats: 2025-12-04T10:01:22.0859957Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.0861676Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0862218Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0862833Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0864373Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0866714Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0869261Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.0871692Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.0874032Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0876359Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0877810Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:22.0878201Z Autotune Choices Stats: 2025-12-04T10:01:22.0879769Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.0881722Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0882580Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0883556Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0885379Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0887878Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0890354Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0892758Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0895166Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0897593Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0900004Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0902413Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0904902Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0907401Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0908893Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:22.0909360Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.0909651Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.0909846Z unimplemented [] 2025-12-04T10:01:22.0910054Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.0910438Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.0911906Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.0913226Z graph_break [] 2025-12-04T10:01:22.0913464Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.0913751Z Autotune Choices Stats: 2025-12-04T10:01:22.0915276Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.0917002Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0917542Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0918203Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0919742Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0922156Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0924551Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.0926881Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.0929199Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0931533Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0932980Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:22.0933368Z Autotune Choices Stats: 2025-12-04T10:01:22.0934951Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.0936918Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0937760Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0938730Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0940635Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0943135Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0945556Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0948011Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0950432Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0952851Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0955361Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.0957894Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.0960408Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.0962822Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0964310Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:22.0964779Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.0965062Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.0965271Z unimplemented [] 2025-12-04T10:01:22.0965474Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.0965849Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.0967321Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.0968649Z graph_break [] 2025-12-04T10:01:22.0968882Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.0969171Z Autotune Choices Stats: 2025-12-04T10:01:22.0970699Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.0972417Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0972970Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0973579Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0975200Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0977534Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0979935Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.0982267Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.0984602Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0986933Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.0988433Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:22.0988828Z Autotune Choices Stats: 2025-12-04T10:01:22.0990407Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.0992371Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.0993217Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.0994253Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.0996146Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.0998592Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1001016Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1003440Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1005857Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1008257Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1010656Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1013132Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1015606Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1018013Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1019513Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:22.1019976Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1020254Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1020449Z unimplemented [] 2025-12-04T10:01:22.1020656Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1021019Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1022490Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1023819Z graph_break [] 2025-12-04T10:01:22.1024046Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1024322Z Autotune Choices Stats: 2025-12-04T10:01:22.1025855Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1027627Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1028180Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1028799Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1030433Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1032834Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1035185Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1037525Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1039860Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1042206Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1043666Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:22.1044055Z Autotune Choices Stats: 2025-12-04T10:01:22.1045623Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1047585Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1048498Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1049533Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1051376Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1053800Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1056338Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1058751Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1061161Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1063585Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1066099Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1068642Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1071051Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1073468Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1074962Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:22.1075427Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1075715Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1075905Z unimplemented [] 2025-12-04T10:01:22.1076110Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1076481Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1077941Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1079261Z graph_break [] 2025-12-04T10:01:22.1079491Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1079778Z Autotune Choices Stats: 2025-12-04T10:01:22.1081319Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1083040Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1083654Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1084335Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1085877Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1088222Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1090553Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1092882Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1095218Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1097546Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1099005Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:22.1099404Z Autotune Choices Stats: 2025-12-04T10:01:22.1101072Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1103163Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1104009Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1104989Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1106824Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1109288Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1111711Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1114137Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1116550Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1119039Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1121522Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1123935Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1126356Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1128792Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1130287Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:22.1130759Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1131053Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1131249Z unimplemented [] 2025-12-04T10:01:22.1131449Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1131819Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1133283Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1134604Z graph_break [] 2025-12-04T10:01:22.1134829Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1135112Z Autotune Choices Stats: 2025-12-04T10:01:22.1136714Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1138510Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1139057Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1139669Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1141214Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1143545Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1145886Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1148276Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1150620Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1152969Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1153225Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:22.1153294Z Autotune Choices Stats: 2025-12-04T10:01:22.1154825Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.1155471Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1155819Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1156385Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1157576Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1158748Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1159936Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1161120Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1162292Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1163572Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1164829Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1166018Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1167200Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1168379Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1168636Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:22.1168769Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1168841Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1168912Z unimplemented [] 2025-12-04T10:01:22.1169020Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1169217Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1170420Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1170484Z graph_break [] 2025-12-04T10:01:22.1170620Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1170688Z Autotune Choices Stats: 2025-12-04T10:01:22.1172198Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.1172510Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1172738Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1173058Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1174215Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1175358Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1176501Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1177645Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1178790Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1179996Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1180315Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:22.1180383Z Autotune Choices Stats: 2025-12-04T10:01:22.1181844Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1182289Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1182633Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1183192Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1184377Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1185556Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1186731Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1187944Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1189185Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1190436Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1191603Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1192781Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1193957Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1195137Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1195391Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:22.1195527Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1195598Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1195666Z unimplemented [] 2025-12-04T10:01:22.1195779Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1195969Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1197237Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1197361Z graph_break [] 2025-12-04T10:01:22.1197496Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1197565Z Autotune Choices Stats: 2025-12-04T10:01:22.1198990Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:22.1199240Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1199466Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1199784Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1200926Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1202062Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1203201Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1204330Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1205484Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1206682Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1206993Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.1207060Z Autotune Choices Stats: 2025-12-04T10:01:22.1208531Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1208973Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1209310Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1209875Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1211063Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1212237Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1213412Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1214650Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1215897Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1217068Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1218244Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1219421Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1220605Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1221776Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1222034Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:22.1222209Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:22.1222289Z Traceback (most recent call last): 2025-12-04T10:01:22.1222599Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:22.1222666Z self.assertTrue( 2025-12-04T10:01:22.1222863Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:22.1223035Z raise self.failureException(msg) 2025-12-04T10:01:22.1223339Z AssertionError: False is not true : Log file /tmp/tmp7_xq874k/flex_attention_configs.json was not created 2025-12-04T10:01:22.1223343Z 2025-12-04T10:01:22.1223487Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:22.1223748Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:22.1223752Z 2025-12-04T10:01:22.1223916Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:22.1224053Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1224128Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1224199Z unimplemented [] 2025-12-04T10:01:22.1224305Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1225518Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:22.1225716Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1225778Z graph_break [] 2025-12-04T10:01:22.1225915Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1226926Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:22.1227017Z current_size = base.storage().size() 2025-12-04T10:01:22.1227093Z Autotune Choices Stats: 2025-12-04T10:01:22.1228570Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:22.1228829Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1229054Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1229385Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1230531Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1231738Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1232931Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1234060Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1235205Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1236331Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1236586Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:22.1236654Z Autotune Choices Stats: 2025-12-04T10:01:22.1238107Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:22.1238557Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1238891Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1239457Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1240708Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1241948Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1243126Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1244296Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1245474Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1246644Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1247818Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1249060Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1250293Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1251469Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1251718Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:22.1251855Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1251924Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1251986Z unimplemented [] 2025-12-04T10:01:22.1252096Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1252281Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1253485Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1253549Z graph_break [] 2025-12-04T10:01:22.1253678Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1253751Z Autotune Choices Stats: 2025-12-04T10:01:22.1255157Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1255839Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1256064Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1256389Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1257668Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1258895Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1260032Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1261174Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1262308Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1263443Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1263696Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:22.1263764Z Autotune Choices Stats: 2025-12-04T10:01:22.1265236Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1265679Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1266131Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1266705Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1268003Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1269184Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1270375Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1271541Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1272710Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1273869Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1275116Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1276306Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1277542Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1278709Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1278957Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:22.1279095Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1279166Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1279230Z unimplemented [] 2025-12-04T10:01:22.1279342Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1279531Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1280737Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1280800Z graph_break [] 2025-12-04T10:01:22.1280928Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1281003Z Autotune Choices Stats: 2025-12-04T10:01:22.1282416Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1282671Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1282894Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1283215Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1284428Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1285647Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1286778Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1287931Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1289057Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1290197Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1290451Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:22.1290528Z Autotune Choices Stats: 2025-12-04T10:01:22.1291998Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1292510Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1292912Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1293483Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1294667Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1295851Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1297028Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1298205Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1299384Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1300551Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1301789Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1303025Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1304194Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1305378Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1305628Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:22.1305766Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1305836Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1305909Z unimplemented [] 2025-12-04T10:01:22.1306018Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1306205Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1307457Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1307522Z graph_break [] 2025-12-04T10:01:22.1307655Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1307728Z Autotune Choices Stats: 2025-12-04T10:01:22.1309136Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1309456Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1309692Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1310073Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1311215Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1312346Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1313476Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1314619Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1315748Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1316897Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1317154Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:22.1317220Z Autotune Choices Stats: 2025-12-04T10:01:22.1318751Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1319280Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1319612Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1320181Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1321369Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1322551Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1323727Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1324901Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1326078Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1327321Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1328552Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1329745Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1330910Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1332089Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1332339Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:22.1332475Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1332544Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1332608Z unimplemented [] 2025-12-04T10:01:22.1332717Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1332904Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1334118Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1334183Z graph_break [] 2025-12-04T10:01:22.1334315Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1334389Z Autotune Choices Stats: 2025-12-04T10:01:22.1335872Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1336187Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1336410Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1336731Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1337878Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1339015Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1340141Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1341275Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1342405Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1343541Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1343786Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:22.1343862Z Autotune Choices Stats: 2025-12-04T10:01:22.1345384Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1346091Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1346431Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1347004Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1348244Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1349428Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1350612Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1351782Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1353044Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1354220Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1355610Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1356783Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1357959Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1359135Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1359388Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:22.1359531Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1359605Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1359671Z unimplemented [] 2025-12-04T10:01:22.1359786Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1359980Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1361186Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1361255Z graph_break [] 2025-12-04T10:01:22.1361389Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1361464Z Autotune Choices Stats: 2025-12-04T10:01:22.1362994Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1363336Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1363559Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1363884Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1365026Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1366165Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1367309Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1368448Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1369578Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1370781Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1371090Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:22.1371162Z Autotune Choices Stats: 2025-12-04T10:01:22.1372617Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1373061Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1373396Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1373962Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1375156Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1376349Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1377524Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1378700Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1379940Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1381178Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1382362Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1383541Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1384733Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1385918Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1386169Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:22.1386309Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1386381Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1386445Z unimplemented [] 2025-12-04T10:01:22.1386554Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1386742Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1388080Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1388215Z graph_break [] 2025-12-04T10:01:22.1388343Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1388417Z Autotune Choices Stats: 2025-12-04T10:01:22.1389833Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1390086Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1390309Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1390637Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1392248Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1393488Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1394880Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1396172Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1397412Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1398553Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1398873Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:22.1398951Z Autotune Choices Stats: 2025-12-04T10:01:22.1400419Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.1400877Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1401215Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1401788Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1402988Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1404174Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1405349Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1406650Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1407893Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1409062Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1410238Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1411407Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1412589Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1413778Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1414035Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:22.1414174Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1414254Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1414319Z unimplemented [] 2025-12-04T10:01:22.1414437Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1414697Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1415903Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1416074Z graph_break [] 2025-12-04T10:01:22.1416209Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1416285Z Autotune Choices Stats: 2025-12-04T10:01:22.1417710Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.1417971Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1418196Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1418512Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1419677Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1420818Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1421956Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1423097Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1424303Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1425505Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1425756Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:22.1425834Z Autotune Choices Stats: 2025-12-04T10:01:22.1427397Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1427852Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1428190Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1428765Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1429956Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1431141Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1432385Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1433623Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1434802Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1435975Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1437150Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1438318Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1439500Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1440674Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1440927Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:22.1441126Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1441262Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1441324Z unimplemented [] 2025-12-04T10:01:22.1441436Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1441622Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1442830Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1442897Z graph_break [] 2025-12-04T10:01:22.1443030Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1443104Z Autotune Choices Stats: 2025-12-04T10:01:22.1444530Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:22.1444784Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1445020Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1445342Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1446501Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1447639Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1448778Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1449985Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1451193Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1452332Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1452582Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.1452657Z Autotune Choices Stats: 2025-12-04T10:01:22.1454110Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1454558Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1454892Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1455670Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1456867Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1458049Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1459338Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1460604Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1461781Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1462954Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1464128Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1465297Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1466477Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1467774Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1468089Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:22.1468223Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1468300Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1468364Z unimplemented [] 2025-12-04T10:01:22.1468475Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1468665Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1469865Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1469937Z graph_break [] 2025-12-04T10:01:22.1470066Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1470139Z Autotune Choices Stats: 2025-12-04T10:01:22.1471559Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1471817Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1472041Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1472357Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1473515Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1474645Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1475860Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1477057Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1478185Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1479324Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1479570Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:22.1479656Z Autotune Choices Stats: 2025-12-04T10:01:22.1481110Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1481554Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1481889Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1482457Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1483641Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1484915Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1486154Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1487332Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1488510Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1489677Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1490855Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1492028Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1493272Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1494506Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1494755Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:22.1494932Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:22.1495017Z Traceback (most recent call last): 2025-12-04T10:01:22.1495329Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:22.1495406Z self.assertTrue( 2025-12-04T10:01:22.1495608Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:22.1495689Z raise self.failureException(msg) 2025-12-04T10:01:22.1495941Z AssertionError: False is not true : Log file /tmp/tmp93xtstbc/flex_attention_configs.json was not created 2025-12-04T10:01:22.1495946Z 2025-12-04T10:01:22.1496081Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:22.1496337Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:22.1496346Z 2025-12-04T10:01:22.1496512Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:22.1496654Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1496772Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1496882Z unimplemented [] 2025-12-04T10:01:22.1497039Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1498870Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:22.1499160Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1499273Z graph_break [] 2025-12-04T10:01:22.1499438Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1500459Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:22.1500549Z current_size = base.storage().size() 2025-12-04T10:01:22.1500618Z Autotune Choices Stats: 2025-12-04T10:01:22.1502146Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:22.1502398Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1502692Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1503013Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1504170Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1505307Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1506445Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1507655Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1508789Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1509921Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1510172Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:22.1510241Z Autotune Choices Stats: 2025-12-04T10:01:22.1511778Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:22.1512286Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1512627Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1513191Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1514387Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1515571Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1516746Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1517918Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1519090Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1520347Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1521578Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1522756Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1523928Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1525094Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1525357Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:22.1525496Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1525571Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1525640Z unimplemented [] 2025-12-04T10:01:22.1525747Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1525950Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1527154Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1527225Z graph_break [] 2025-12-04T10:01:22.1527355Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1527422Z Autotune Choices Stats: 2025-12-04T10:01:22.1528930Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1529245Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1529475Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1529792Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1530944Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1532072Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1533203Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1534335Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1535468Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1536598Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1536919Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:22.1536990Z Autotune Choices Stats: 2025-12-04T10:01:22.1538564Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1539011Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1539364Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1539934Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1541126Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1542301Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1543476Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1544651Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1545900Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1547159Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1548388Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1549561Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1550731Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1551902Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1552163Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:22.1552298Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1552370Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1552439Z unimplemented [] 2025-12-04T10:01:22.1552555Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1552754Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1553955Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1554016Z graph_break [] 2025-12-04T10:01:22.1554223Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1554292Z Autotune Choices Stats: 2025-12-04T10:01:22.1555945Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1556198Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1556429Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1556741Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1557892Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1559018Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1560156Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1561299Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1562424Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1563664Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1564005Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:22.1564074Z Autotune Choices Stats: 2025-12-04T10:01:22.1565525Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1565965Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1566299Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1566864Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1568058Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1569236Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1570415Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1571589Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1572822Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1574061Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1575228Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1576401Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1577579Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1578744Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1578998Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:22.1579129Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1579198Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1579265Z unimplemented [] 2025-12-04T10:01:22.1579369Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1579560Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1580827Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1580971Z graph_break [] 2025-12-04T10:01:22.1581107Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1581174Z Autotune Choices Stats: 2025-12-04T10:01:22.1582594Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1582842Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1583076Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1583389Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1584532Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1585661Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1586793Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1587972Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1589166Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1590378Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1590628Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:22.1590696Z Autotune Choices Stats: 2025-12-04T10:01:22.1592163Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1592605Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1592943Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1593505Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1594698Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1595876Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1597060Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1598303Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1599531Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1600713Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1601881Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1603054Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1604238Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1605406Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1605662Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:22.1605793Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1605863Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1605940Z unimplemented [] 2025-12-04T10:01:22.1606052Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1606309Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1607576Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1607638Z graph_break [] 2025-12-04T10:01:22.1607775Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1607843Z Autotune Choices Stats: 2025-12-04T10:01:22.1609259Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1609507Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1609731Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1610045Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1611193Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1612340Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1613479Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1614610Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1616032Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1617394Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1617643Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:22.1617724Z Autotune Choices Stats: 2025-12-04T10:01:22.1619178Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1619616Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1619950Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1620515Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1621706Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1622881Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1624134Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1625511Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1626851Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1628081Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1629256Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1630430Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1631596Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1632768Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1633091Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:22.1633226Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1633361Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1633430Z unimplemented [] 2025-12-04T10:01:22.1633533Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1633719Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1634923Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1634988Z graph_break [] 2025-12-04T10:01:22.1635123Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1635195Z Autotune Choices Stats: 2025-12-04T10:01:22.1636604Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1636851Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1637081Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1637411Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1638555Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1639694Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1640838Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1642035Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1643227Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1644361Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1644614Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:22.1644683Z Autotune Choices Stats: 2025-12-04T10:01:22.1646149Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1646591Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1646927Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1647489Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1648686Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1649949Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1651128Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1652369Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1653544Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1654715Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1656390Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1657572Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1658740Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1660023Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1660373Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:22.1660509Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1660582Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1660654Z unimplemented [] 2025-12-04T10:01:22.1660759Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1660949Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1662153Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1662219Z graph_break [] 2025-12-04T10:01:22.1662355Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1662427Z Autotune Choices Stats: 2025-12-04T10:01:22.1663847Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1664097Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1664315Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1664639Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1665798Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1666937Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1668190Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1669390Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1670533Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1671669Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1671923Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:22.1671992Z Autotune Choices Stats: 2025-12-04T10:01:22.1673454Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.1673900Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1674239Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1674804Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1675996Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1677233Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1678551Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1679720Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1680909Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1682079Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1683253Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1684427Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1685662Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1686898Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1687150Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:22.1687280Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1687349Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1687419Z unimplemented [] 2025-12-04T10:01:22.1687523Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1687712Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1688911Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1688973Z graph_break [] 2025-12-04T10:01:22.1689116Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1689186Z Autotune Choices Stats: 2025-12-04T10:01:22.1690608Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.1690858Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1691077Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1691401Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1692541Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1693748Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1694948Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1696082Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1697217Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1698355Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1698606Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:22.1698673Z Autotune Choices Stats: 2025-12-04T10:01:22.1700129Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1700570Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1700911Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1701474Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1702728Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1703965Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1705147Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1706317Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1707540Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1708718Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1709893Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1711067Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1712303Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1713554Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1713802Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:22.1713937Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1714006Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1714075Z unimplemented [] 2025-12-04T10:01:22.1714178Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1714364Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1715577Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1715641Z graph_break [] 2025-12-04T10:01:22.1715774Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1715846Z Autotune Choices Stats: 2025-12-04T10:01:22.1717269Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:22.1717519Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1717745Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1718071Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1719215Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1720418Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1721613Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1722747Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1723889Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1725019Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1725282Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.1725350Z Autotune Choices Stats: 2025-12-04T10:01:22.1732125Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1732617Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1732966Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1733642Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1734927Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1736103Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1737274Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1738449Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1739618Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1740796Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1741968Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1743205Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1744429Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1745595Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1745858Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:22.1746009Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1746083Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1746156Z unimplemented [] 2025-12-04T10:01:22.1746263Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1746455Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1747758Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1747827Z graph_break [] 2025-12-04T10:01:22.1747967Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1748038Z Autotune Choices Stats: 2025-12-04T10:01:22.1749462Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1749720Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1749944Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1750266Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1751521Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1752720Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1753851Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1755002Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1756341Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1757480Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1757740Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:22.1757807Z Autotune Choices Stats: 2025-12-04T10:01:22.1759276Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1759718Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1760173Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1760830Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1762015Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1763189Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1764364Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1765536Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1766703Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1767869Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1769100Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1770331Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1771507Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1772683Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1772936Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:22.1773079Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1773154Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1773220Z unimplemented [] 2025-12-04T10:01:22.1773330Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1773520Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1774726Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1774788Z graph_break [] 2025-12-04T10:01:22.1774929Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1774997Z Autotune Choices Stats: 2025-12-04T10:01:22.1776433Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:22.1776686Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1776977Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1777301Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1778515Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1779647Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1780776Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1781912Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1783057Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1784191Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1784445Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:22.1784512Z Autotune Choices Stats: 2025-12-04T10:01:22.1786050Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1786555Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1786893Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1787497Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1788684Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1789860Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1791040Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1792215Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1793404Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1794643Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1795872Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1797045Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1798209Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1799379Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1799628Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:22.1799810Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:22.1799890Z Traceback (most recent call last): 2025-12-04T10:01:22.1800198Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:22.1800274Z self.assertTrue( 2025-12-04T10:01:22.1800481Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:22.1800572Z raise self.failureException(msg) 2025-12-04T10:01:22.1800817Z AssertionError: False is not true : Log file /tmp/tmp62tcpu7h/flex_attention_configs.json was not created 2025-12-04T10:01:22.1800824Z 2025-12-04T10:01:22.1800961Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:22.1801234Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:22.1801238Z 2025-12-04T10:01:22.1801404Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:22.1801546Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1801618Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1801683Z unimplemented [] 2025-12-04T10:01:22.1801798Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1803077Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:22.1803621Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1803685Z graph_break [] 2025-12-04T10:01:22.1803820Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1804839Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:22.1804923Z current_size = base.storage().size() 2025-12-04T10:01:22.1804998Z Autotune Choices Stats: 2025-12-04T10:01:22.1806428Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:22.1806691Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1806917Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1807239Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1808395Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1809533Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1810673Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1811868Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1813073Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1814201Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1814453Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:22.1814527Z Autotune Choices Stats: 2025-12-04T10:01:22.1815973Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:22.1816422Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1816755Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1817441Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1818636Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1819812Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1821091Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1822342Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1823515Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1824691Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1825970Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1827138Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1828370Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1829614Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1829930Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:22.1830066Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1830144Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1830209Z unimplemented [] 2025-12-04T10:01:22.1830321Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1830510Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1831710Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1831782Z graph_break [] 2025-12-04T10:01:22.1831912Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1831983Z Autotune Choices Stats: 2025-12-04T10:01:22.1833408Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1833664Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1833889Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1834210Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1835377Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1836512Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1837643Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1838846Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1840035Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1841176Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1841426Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:22.1841501Z Autotune Choices Stats: 2025-12-04T10:01:22.1842961Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1843411Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1843746Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1844318Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1845504Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1846751Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1847995Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1849187Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1850366Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1851535Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1852716Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1853884Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1855053Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1856474Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1856816Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:22.1856951Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1857028Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1857091Z unimplemented [] 2025-12-04T10:01:22.1857196Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1857396Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1858590Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1858661Z graph_break [] 2025-12-04T10:01:22.1858790Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1858863Z Autotune Choices Stats: 2025-12-04T10:01:22.1860275Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1860529Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1860753Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1861068Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1862226Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1863363Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1864589Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1865792Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1866925Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1868109Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1868365Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:22.1868444Z Autotune Choices Stats: 2025-12-04T10:01:22.1869905Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1870351Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1870686Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1871253Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1872439Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1873688Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1874930Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1876120Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1877300Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1878465Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1879643Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1880816Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1882070Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1883307Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1883559Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:22.1883696Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1883772Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1883838Z unimplemented [] 2025-12-04T10:01:22.1883942Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1884144Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1885351Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1885420Z graph_break [] 2025-12-04T10:01:22.1885549Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1885624Z Autotune Choices Stats: 2025-12-04T10:01:22.1887028Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1887281Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1887501Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1887820Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1888973Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1890168Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1891369Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1892506Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1893630Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1894763Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1895010Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:22.1895083Z Autotune Choices Stats: 2025-12-04T10:01:22.1896541Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1896985Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1897316Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1897882Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1899120Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1900360Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1901527Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1902700Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1903865Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1905048Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1906216Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1907517Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1908764Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1910021Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1910274Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:22.1910404Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1910482Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1910546Z unimplemented [] 2025-12-04T10:01:22.1910648Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1910840Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1912044Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1912115Z graph_break [] 2025-12-04T10:01:22.1912245Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1912313Z Autotune Choices Stats: 2025-12-04T10:01:22.1913730Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1913976Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1914203Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1914517Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1915751Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1916884Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1918087Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1919224Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1920347Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1921490Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1921735Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:22.1921807Z Autotune Choices Stats: 2025-12-04T10:01:22.1923273Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1923717Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1924049Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1924685Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1925940Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1927132Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1928319Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1929503Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1930672Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1931850Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1933019Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1934250Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1935488Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1936656Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1936909Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:22.1937039Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1937113Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1937176Z unimplemented [] 2025-12-04T10:01:22.1937276Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1937468Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1938660Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1938726Z graph_break [] 2025-12-04T10:01:22.1938853Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1938920Z Autotune Choices Stats: 2025-12-04T10:01:22.1940333Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1940580Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1940809Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1941124Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1942331Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1943551Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1944695Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1945847Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1946977Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1948160Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1948406Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:22.1948481Z Autotune Choices Stats: 2025-12-04T10:01:22.1949926Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.1950436Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1950770Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1951404Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1952588Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1953778Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1954956Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1956288Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1957460Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1958634Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1959951Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1961216Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1962405Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1963582Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1963837Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:22.1963967Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1964044Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1964107Z unimplemented [] 2025-12-04T10:01:22.1964208Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1964399Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1965599Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1965667Z graph_break [] 2025-12-04T10:01:22.1965796Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1965868Z Autotune Choices Stats: 2025-12-04T10:01:22.1967291Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.1967538Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1967837Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1968223Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1969370Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1970498Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1971634Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1972772Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.1973912Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.1975051Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1975300Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:22.1975374Z Autotune Choices Stats: 2025-12-04T10:01:22.1976894Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.1977467Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1977802Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1978367Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1979562Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1980755Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1981936Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1983123Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1984294Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1985550Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.1986789Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1988010Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.1989191Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.1990363Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.1990619Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:22.1990751Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.1990825Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.1990888Z unimplemented [] 2025-12-04T10:01:22.1990991Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.1991186Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.1992381Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.1992451Z graph_break [] 2025-12-04T10:01:22.1992579Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.1992645Z Autotune Choices Stats: 2025-12-04T10:01:22.1994143Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.1994454Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.1994683Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.1994998Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.1996151Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.1997287Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2002949Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2004116Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2005296Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2006427Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2006684Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:22.2006761Z Autotune Choices Stats: 2025-12-04T10:01:22.2008353Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2008844Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2009188Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2009759Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2010956Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2012219Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2013386Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2014556Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2015735Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2016993Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2018199Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2019357Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2020530Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2021748Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2022003Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:22.2022144Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2022221Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2022285Z unimplemented [] 2025-12-04T10:01:22.2022398Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2022589Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2023786Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2023852Z graph_break [] 2025-12-04T10:01:22.2023983Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2024056Z Autotune Choices Stats: 2025-12-04T10:01:22.2025534Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:22.2025821Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2026044Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2026362Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2027597Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2028727Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2029895Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2031032Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2032155Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2033353Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2033639Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.2033712Z Autotune Choices Stats: 2025-12-04T10:01:22.2035164Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2035616Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2035948Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2036513Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2037731Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2038918Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2040082Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2041246Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2042476Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2043672Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2044848Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2046016Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2047240Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2048411Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2048659Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:22.2048798Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2048874Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2048937Z unimplemented [] 2025-12-04T10:01:22.2049046Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2049235Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2050499Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2050604Z graph_break [] 2025-12-04T10:01:22.2050734Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2050805Z Autotune Choices Stats: 2025-12-04T10:01:22.2052210Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2052469Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2052690Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2053003Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2054145Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2055525Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2056670Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2057803Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2058927Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2060170Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2060465Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:22.2060540Z Autotune Choices Stats: 2025-12-04T10:01:22.2061984Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2062441Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2062782Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2063411Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2064588Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2065776Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2066938Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2068235Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2069448Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2070611Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2071775Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2072976Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2074136Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2075306Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2075554Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:22.2075683Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2075756Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2075821Z unimplemented [] 2025-12-04T10:01:22.2075922Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2076112Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2077381Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2077497Z graph_break [] 2025-12-04T10:01:22.2077626Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2077695Z Autotune Choices Stats: 2025-12-04T10:01:22.2079106Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:22.2079359Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2079577Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2079888Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2081072Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2082188Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2083326Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2084455Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2085651Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2086810Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2087057Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:22.2087130Z Autotune Choices Stats: 2025-12-04T10:01:22.2088580Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2089062Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2089393Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2089959Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2091140Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2092316Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2093572Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2094825Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2095993Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2097161Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2098332Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2099527Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2100704Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2101870Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2102119Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:22.2102316Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2102394Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2102492Z unimplemented [] 2025-12-04T10:01:22.2102594Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2102784Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2103981Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2104050Z graph_break [] 2025-12-04T10:01:22.2104178Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2104254Z Autotune Choices Stats: 2025-12-04T10:01:22.2105667Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2105959Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2106180Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2106492Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2107679Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2108803Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2109937Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2111149Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2112312Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2113442Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2113689Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.2113763Z Autotune Choices Stats: 2025-12-04T10:01:22.2115214Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2115698Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2116030Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2116595Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2117774Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2118941Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2120167Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2121370Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2122541Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2123710Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2124908Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2126085Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2127257Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2128485Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2128779Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:22.2128955Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:22.2129044Z Traceback (most recent call last): 2025-12-04T10:01:22.2129349Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:22.2129426Z self.assertTrue( 2025-12-04T10:01:22.2129639Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:22.2129722Z raise self.failureException(msg) 2025-12-04T10:01:22.2129973Z AssertionError: False is not true : Log file /tmp/tmpe81b4lib/flex_attention_configs.json was not created 2025-12-04T10:01:22.2129981Z 2025-12-04T10:01:22.2130116Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:22.2130375Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:22.2130380Z 2025-12-04T10:01:22.2130550Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:22.2130684Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2130759Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2130822Z unimplemented [] 2025-12-04T10:01:22.2130928Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2132146Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:22.2132385Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2132449Z graph_break [] 2025-12-04T10:01:22.2132579Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2133581Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:22.2133668Z current_size = base.storage().size() 2025-12-04T10:01:22.2133736Z Autotune Choices Stats: 2025-12-04T10:01:22.2135166Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:22.2135414Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2135646Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2136041Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2137225Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2138346Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2139484Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2140598Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2141766Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2142888Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2143146Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:22.2143215Z Autotune Choices Stats: 2025-12-04T10:01:22.2144664Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:22.2145199Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2145574Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2146133Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2147409Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2148579Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2149793Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2150958Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2152122Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2153283Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2154508Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2155881Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2157051Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2158214Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2158536Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:22.2158675Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2158746Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2158815Z unimplemented [] 2025-12-04T10:01:22.2158917Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2159107Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2160310Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2160375Z graph_break [] 2025-12-04T10:01:22.2160510Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2160588Z Autotune Choices Stats: 2025-12-04T10:01:22.2162005Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2162345Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2162627Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2162947Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2164101Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2165242Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2166375Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2167544Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2168673Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2169800Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2170053Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:22.2170121Z Autotune Choices Stats: 2025-12-04T10:01:22.2171634Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2172117Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2172464Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2173036Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2174252Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2175442Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2176679Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2177856Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2179030Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2180567Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2181792Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2182972Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2184142Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2185357Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2185633Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:22.2185778Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2185849Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2185925Z unimplemented [] 2025-12-04T10:01:22.2186032Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2186220Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2187517Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2187581Z graph_break [] 2025-12-04T10:01:22.2187719Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2187787Z Autotune Choices Stats: 2025-12-04T10:01:22.2189286Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2189570Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2189798Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2190112Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2191260Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2192393Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2193580Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2194707Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2195834Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2196959Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2197212Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:22.2197347Z Autotune Choices Stats: 2025-12-04T10:01:22.2198800Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2199279Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2199625Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2200188Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2201372Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2202577Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2203765Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2204951Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2206191Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2207411Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2208573Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2209742Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2210944Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2212111Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2212365Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:22.2212503Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2212573Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2212641Z unimplemented [] 2025-12-04T10:01:22.2212744Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2212929Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2214125Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2214188Z graph_break [] 2025-12-04T10:01:22.2214323Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2214457Z Autotune Choices Stats: 2025-12-04T10:01:22.2215878Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2216160Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2216390Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2216706Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2217851Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2218984Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2220145Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2221275Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2222407Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2223598Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2223886Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:22.2223953Z Autotune Choices Stats: 2025-12-04T10:01:22.2225417Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2225858Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2226196Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2226759Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2228029Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2229207Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2230389Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2231564Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2232850Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2234049Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2235236Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2236403Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2237604Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2238778Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2239029Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:22.2239161Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2239238Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2239305Z unimplemented [] 2025-12-04T10:01:22.2239413Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2239599Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2240879Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2240983Z graph_break [] 2025-12-04T10:01:22.2241114Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2241188Z Autotune Choices Stats: 2025-12-04T10:01:22.2242606Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2242861Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2243085Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2243398Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2244550Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2245719Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2246865Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2248007Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2249197Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2250359Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2250604Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:22.2250681Z Autotune Choices Stats: 2025-12-04T10:01:22.2252129Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2252587Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2252920Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2253527Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2254713Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2256081Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2257259Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2258542Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2259970Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2261136Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2262311Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2263528Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2264695Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2265880Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2266133Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:22.2266263Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2266338Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2266401Z unimplemented [] 2025-12-04T10:01:22.2266501Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2266759Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2268001Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2268107Z graph_break [] 2025-12-04T10:01:22.2268236Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2268324Z Autotune Choices Stats: 2025-12-04T10:01:22.2269741Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2269992Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2270213Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2270565Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2271715Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2272842Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2273982Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2275130Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2276340Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2277504Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2277753Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:22.2277827Z Autotune Choices Stats: 2025-12-04T10:01:22.2279282Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2279763Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2280100Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2280669Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2281852Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2283033Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2284276Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2285487Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2286660Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2287835Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2289066Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2290237Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2291407Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2292574Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2292827Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:22.2293025Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2293134Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2293197Z unimplemented [] 2025-12-04T10:01:22.2293304Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2293494Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2294692Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2294767Z graph_break [] 2025-12-04T10:01:22.2294901Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2294975Z Autotune Choices Stats: 2025-12-04T10:01:22.2296394Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2296686Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2296912Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2297227Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2298393Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2299519Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2300655Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2301859Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2303020Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2304163Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2304412Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:22.2304487Z Autotune Choices Stats: 2025-12-04T10:01:22.2305950Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.2306451Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2306783Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2307388Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2308577Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2309760Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2311004Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2312215Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2313379Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2314558Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2315789Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2316964Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2318135Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2319366Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2319653Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:22.2319783Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2319857Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2319922Z unimplemented [] 2025-12-04T10:01:22.2320022Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2320214Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2321411Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2321478Z graph_break [] 2025-12-04T10:01:22.2321606Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2321674Z Autotune Choices Stats: 2025-12-04T10:01:22.2323095Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.2323380Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2323612Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2323928Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2325070Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2326204Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2327399Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2328572Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2329709Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2330854Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2331136Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:22.2331207Z Autotune Choices Stats: 2025-12-04T10:01:22.2332664Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2333109Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2333447Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2334012Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2335193Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2336436Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2337668Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2338841Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2340009Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2341223Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2342397Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2343561Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2344793Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2345999Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2346255Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:22.2346390Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2346465Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2346527Z unimplemented [] 2025-12-04T10:01:22.2346631Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2346834Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2348074Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2348181Z graph_break [] 2025-12-04T10:01:22.2348310Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2348377Z Autotune Choices Stats: 2025-12-04T10:01:22.2349798Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:22.2350044Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2350274Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2350588Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2351744Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2352943Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2354071Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2355355Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2356489Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2357625Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2357935Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.2358009Z Autotune Choices Stats: 2025-12-04T10:01:22.2359466Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2359919Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2360258Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2360822Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2362098Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2363326Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2364501Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2365676Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2366881Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2368055Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2369232Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2370405Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2371708Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2372906Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2373163Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:22.2373297Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2373372Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2373438Z unimplemented [] 2025-12-04T10:01:22.2373539Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2373735Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2374936Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2375041Z graph_break [] 2025-12-04T10:01:22.2375180Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2375258Z Autotune Choices Stats: 2025-12-04T10:01:22.2376676Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2376923Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2377151Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2377466Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2378610Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2379800Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2380967Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2382112Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2383242Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2384408Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2384653Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:22.2384725Z Autotune Choices Stats: 2025-12-04T10:01:22.2386182Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2386625Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2386958Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2387636Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2388851Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2390058Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2391243Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2392420Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2393626Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2394805Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2395980Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2397209Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2398404Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2399574Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2399828Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:22.2399959Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2400032Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2400130Z unimplemented [] 2025-12-04T10:01:22.2400231Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2400419Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2401623Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2401690Z graph_break [] 2025-12-04T10:01:22.2401818Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2401885Z Autotune Choices Stats: 2025-12-04T10:01:22.2403309Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:22.2403559Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2403787Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2404097Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2405349Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2406514Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2407652Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2408784Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2409950Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2411090Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2411334Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:22.2411403Z Autotune Choices Stats: 2025-12-04T10:01:22.2412870Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2413315Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2413708Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2414301Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2415501Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2416678Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2417860Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2419071Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2420248Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2421424Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2422665Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2423864Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2425103Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2426501Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2426814Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:22.2426951Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2427025Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2427089Z unimplemented [] 2025-12-04T10:01:22.2427192Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2427446Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2428641Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2428711Z graph_break [] 2025-12-04T10:01:22.2428842Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2428912Z Autotune Choices Stats: 2025-12-04T10:01:22.2430344Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2430592Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2430821Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2431205Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2432395Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2433531Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2434673Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2435856Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2436983Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2438121Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2438372Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.2438441Z Autotune Choices Stats: 2025-12-04T10:01:22.2439973Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2440418Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2440782Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2441339Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2442530Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2443715Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2444980Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2446349Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2447523Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2448757Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2449935Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2451142Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2452321Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2453492Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2453782Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:22.2453913Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2453988Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2454050Z unimplemented [] 2025-12-04T10:01:22.2454153Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2454342Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2455676Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2455746Z graph_break [] 2025-12-04T10:01:22.2455879Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2455947Z Autotune Choices Stats: 2025-12-04T10:01:22.2457472Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.2457724Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2457993Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2458310Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2459457Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2460590Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2461745Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2462927Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2464061Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2465280Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2465572Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:22.2465655Z Autotune Choices Stats: 2025-12-04T10:01:22.2467364Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.2467854Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2468193Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2468756Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2469943Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2471154Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2472334Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2473527Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2474693Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2475941Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2477138Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2478310Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2479496Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2480710Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2480962Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:22.2481136Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:22.2481221Z Traceback (most recent call last): 2025-12-04T10:01:22.2481524Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:22.2481593Z self.assertTrue( 2025-12-04T10:01:22.2481796Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:22.2481879Z raise self.failureException(msg) 2025-12-04T10:01:22.2482125Z AssertionError: False is not true : Log file /tmp/tmp5am0ftj2/flex_attention_configs.json was not created 2025-12-04T10:01:22.2482132Z 2025-12-04T10:01:22.2482282Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:22.2482547Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:22.2482551Z 2025-12-04T10:01:22.2482725Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:22.2482865Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2482937Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2483006Z unimplemented [] 2025-12-04T10:01:22.2483181Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2484402Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:22.2484623Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2484684Z graph_break [] 2025-12-04T10:01:22.2484826Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2485841Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:22.2485931Z current_size = base.storage().size() 2025-12-04T10:01:22.2486001Z Autotune Choices Stats: 2025-12-04T10:01:22.2487435Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:22.2487723Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2487949Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2488271Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2489424Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2490563Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2491711Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2492909Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2494071Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2495212Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2495470Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:22.2495536Z Autotune Choices Stats: 2025-12-04T10:01:22.2496993Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:22.2497470Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2497811Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2498380Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2499573Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2500833Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2502036Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2503203Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2504372Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2505622Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2506788Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2508017Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2509186Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2510421Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2510702Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:22.2510836Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2510906Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2510977Z unimplemented [] 2025-12-04T10:01:22.2511082Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2511271Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2512471Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2512532Z graph_break [] 2025-12-04T10:01:22.2512669Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2512736Z Autotune Choices Stats: 2025-12-04T10:01:22.2514154Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2514439Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2514661Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2514990Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2516140Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2517282Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2518493Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2519654Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2520792Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2521920Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2522205Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:22.2522276Z Autotune Choices Stats: 2025-12-04T10:01:22.2523740Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2524179Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2524516Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2525080Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2526279Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2527529Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2528726Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2529893Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2531074Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2532275Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2533448Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2534622Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2535876Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2537089Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2537341Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:22.2537477Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2537548Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2537618Z unimplemented [] 2025-12-04T10:01:22.2537721Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2537919Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2539116Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2539219Z graph_break [] 2025-12-04T10:01:22.2539353Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2539423Z Autotune Choices Stats: 2025-12-04T10:01:22.2540832Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2541085Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2541307Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2541631Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2542776Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2543979Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2545147Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2546285Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2547466Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2548643Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2548897Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:22.2548965Z Autotune Choices Stats: 2025-12-04T10:01:22.2550438Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2550878Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2551217Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2551789Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2553046Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2554269Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2555598Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2556774Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2558025Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2559203Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2560374Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2561639Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2562847Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2564021Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2564270Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:22.2564413Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2564483Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2564554Z unimplemented [] 2025-12-04T10:01:22.2564658Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2564847Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2566059Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2566199Z graph_break [] 2025-12-04T10:01:22.2566344Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2566412Z Autotune Choices Stats: 2025-12-04T10:01:22.2567822Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2568075Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2568297Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2568614Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2569837Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2570979Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2572149Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2573287Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2574422Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2575594Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2575848Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:22.2575917Z Autotune Choices Stats: 2025-12-04T10:01:22.2577383Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2577828Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2578165Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2578794Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2580022Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2581206Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2582389Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2583596Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2584775Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2585956Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2587132Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2588410Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2589615Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2590797Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2591048Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:22.2591186Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2591289Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2591353Z unimplemented [] 2025-12-04T10:01:22.2591462Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2591650Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2592867Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2592939Z graph_break [] 2025-12-04T10:01:22.2593077Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2593148Z Autotune Choices Stats: 2025-12-04T10:01:22.2594563Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2594819Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2595049Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2595371Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2596587Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2597773Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2598911Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2600048Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2601212Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2602347Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2602606Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:22.2602677Z Autotune Choices Stats: 2025-12-04T10:01:22.2604141Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2604583Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2604985Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2605589Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2606768Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2607955Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2609132Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2610341Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2611519Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2612686Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2613927Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2615131Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2616312Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2617487Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2617772Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:22.2617911Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2617984Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2618048Z unimplemented [] 2025-12-04T10:01:22.2618159Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2618346Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2619551Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2619616Z graph_break [] 2025-12-04T10:01:22.2619746Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2619823Z Autotune Choices Stats: 2025-12-04T10:01:22.2621238Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2621495Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2621782Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2622149Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2623295Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2624434Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2625571Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2626742Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2627914Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2629056Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2629308Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:22.2629377Z Autotune Choices Stats: 2025-12-04T10:01:22.2630912Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2631402Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2631732Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2632300Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2633492Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2634682Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2636132Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2637465Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2638649Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2640132Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2641423Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2642611Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2643787Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2645011Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2645267Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:22.2645409Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2645482Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2645547Z unimplemented [] 2025-12-04T10:01:22.2645658Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2645850Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2647071Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2647135Z graph_break [] 2025-12-04T10:01:22.2647268Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2647342Z Autotune Choices Stats: 2025-12-04T10:01:22.2648846Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2649137Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2649364Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2649689Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2650843Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2651989Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2653120Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2654307Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2655573Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2656724Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2656981Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:22.2657051Z Autotune Choices Stats: 2025-12-04T10:01:22.2658622Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.2659121Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2659460Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2660030Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2661225Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2662468Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2663652Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2664843Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2666031Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2667356Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2668581Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2669756Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2670931Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2672143Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2672395Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:22.2672541Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2672612Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2672677Z unimplemented [] 2025-12-04T10:01:22.2672792Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2672981Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2674189Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2674253Z graph_break [] 2025-12-04T10:01:22.2674386Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2674472Z Autotune Choices Stats: 2025-12-04T10:01:22.2675962Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.2676254Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2676477Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2676803Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2677957Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2679102Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2680273Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2681422Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2682547Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2683750Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2684036Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:22.2684105Z Autotune Choices Stats: 2025-12-04T10:01:22.2685561Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2686010Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2686345Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2686913Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2688106Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2689323Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2690508Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2691683Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2692921Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2694121Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2695295Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2696474Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2697699Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2698878Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2699127Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:22.2699263Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2699334Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2699397Z unimplemented [] 2025-12-04T10:01:22.2699504Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2699687Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2700956Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2701023Z graph_break [] 2025-12-04T10:01:22.2701187Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2701258Z Autotune Choices Stats: 2025-12-04T10:01:22.2702668Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:22.2702922Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2703143Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2703469Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2704610Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2705800Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2706927Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2708110Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2709246Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2710469Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2711024Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.2711094Z Autotune Choices Stats: 2025-12-04T10:01:22.2712565Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2713014Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2713345Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2713956Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2715140Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2716323Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2717509Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2718750Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2719958Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2721123Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2722300Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2723524Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2724690Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2725865Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2726112Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:22.2726249Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2726318Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2726381Z unimplemented [] 2025-12-04T10:01:22.2726493Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2726681Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2727951Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2728065Z graph_break [] 2025-12-04T10:01:22.2728193Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2728271Z Autotune Choices Stats: 2025-12-04T10:01:22.2729685Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2729941Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2730163Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2730483Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2731671Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2737440Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2738663Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2739820Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2741051Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2742233Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2742491Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:22.2742570Z Autotune Choices Stats: 2025-12-04T10:01:22.2744033Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2744486Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2744866Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2745442Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2746633Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2747916Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2749155Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2750333Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2751537Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2752702Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2753880Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2755073Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2756453Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2757616Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2757881Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:22.2758033Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2758231Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2758300Z unimplemented [] 2025-12-04T10:01:22.2758460Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2758655Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2759858Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2759932Z graph_break [] 2025-12-04T10:01:22.2760068Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2760145Z Autotune Choices Stats: 2025-12-04T10:01:22.2761577Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:22.2761839Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2762125Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2762455Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2763608Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2764743Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2765872Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2767087Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2768260Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2769396Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2769646Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:22.2769722Z Autotune Choices Stats: 2025-12-04T10:01:22.2771184Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2771668Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2772004Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2772576Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2773760Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2774941Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2776180Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2777379Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2778550Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2779712Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2781134Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2782309Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2783480Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2784771Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2785065Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:22.2785203Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2785282Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2785347Z unimplemented [] 2025-12-04T10:01:22.2785457Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2785651Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2786854Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2786930Z graph_break [] 2025-12-04T10:01:22.2787061Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2787137Z Autotune Choices Stats: 2025-12-04T10:01:22.2788626Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2788927Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2789149Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2789464Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2790614Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2791755Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2792955Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2794089Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2795270Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2796415Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2796665Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.2796780Z Autotune Choices Stats: 2025-12-04T10:01:22.2798255Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2798711Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2799047Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2799624Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2800810Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2802090Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2803307Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2804495Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2805681Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2806888Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2808075Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2809248Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2810425Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2811660Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2811949Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:22.2812086Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2812166Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2812233Z unimplemented [] 2025-12-04T10:01:22.2812348Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2812543Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2813748Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2813825Z graph_break [] 2025-12-04T10:01:22.2813957Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2814070Z Autotune Choices Stats: 2025-12-04T10:01:22.2815497Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.2815756Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2815978Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2816299Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2817454Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2818594Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2819788Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2820959Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2822100Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2823244Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2823523Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:22.2823598Z Autotune Choices Stats: 2025-12-04T10:01:22.2825046Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.2825493Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2825823Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2826387Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2827679Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2828901Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2830073Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2831251Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2832478Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2833651Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2834829Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2835992Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2837229Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2838429Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2838683Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:22.2838814Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2838893Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2838957Z unimplemented [] 2025-12-04T10:01:22.2839069Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2839262Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2840460Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2840565Z graph_break [] 2025-12-04T10:01:22.2840700Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2840774Z Autotune Choices Stats: 2025-12-04T10:01:22.2842190Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2842454Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2842683Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2843001Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2844149Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2845353Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2846535Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2847669Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2848803Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2849983Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2850230Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:22.2850303Z Autotune Choices Stats: 2025-12-04T10:01:22.2851757Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2852208Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2852540Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2853112Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2854359Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2855726Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2856900Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2858073Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2859327Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2860504Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2861676Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2862934Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2864169Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2865343Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2865594Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:22.2865772Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:22.2865862Z Traceback (most recent call last): 2025-12-04T10:01:22.2866170Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:22.2866284Z self.assertTrue( 2025-12-04T10:01:22.2866488Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:22.2866572Z raise self.failureException(msg) 2025-12-04T10:01:22.2866827Z AssertionError: False is not true : Log file /tmp/tmplcc40_74/flex_attention_configs.json was not created 2025-12-04T10:01:22.2866834Z 2025-12-04T10:01:22.2866971Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:22.2867272Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:22.2867284Z 2025-12-04T10:01:22.2867454Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:22.2867590Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2867670Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2867736Z unimplemented [] 2025-12-04T10:01:22.2867843Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2869069Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:22.2869262Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2869330Z graph_break [] 2025-12-04T10:01:22.2869466Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2870472Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:22.2870562Z current_size = base.storage().size() 2025-12-04T10:01:22.2870704Z Autotune Choices Stats: 2025-12-04T10:01:22.2872138Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:22.2872427Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2872657Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2872978Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2874126Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2875250Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2876412Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2877537Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2878667Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2879861Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2880142Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:22.2880210Z Autotune Choices Stats: 2025-12-04T10:01:22.2881656Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:22.2882107Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2882444Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2883006Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2884223Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2885385Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2886557Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2887719Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2888942Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2890141Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2891311Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2892484Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2893695Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2894860Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2895127Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:22.2895258Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2895332Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2895400Z unimplemented [] 2025-12-04T10:01:22.2895505Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2895698Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2896982Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2897095Z graph_break [] 2025-12-04T10:01:22.2897226Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2897294Z Autotune Choices Stats: 2025-12-04T10:01:22.2898706Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2898953Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2899182Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2899497Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2900645Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2901808Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2902945Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2904074Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2905272Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2906445Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2906692Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:22.2906762Z Autotune Choices Stats: 2025-12-04T10:01:22.2908262Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2908714Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2909053Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2909660Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2910840Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2912013Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2913176Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2914410Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2915598Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2916764Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2917925Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2919131Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2920297Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2921458Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2921714Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:22.2921845Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2921917Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2921991Z unimplemented [] 2025-12-04T10:01:22.2922096Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2922413Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2923620Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2923718Z graph_break [] 2025-12-04T10:01:22.2923859Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2923928Z Autotune Choices Stats: 2025-12-04T10:01:22.2925357Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2925606Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2925832Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2926189Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2927345Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2928476Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2929613Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2930749Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2931965Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2933131Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2933394Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:22.2933464Z Autotune Choices Stats: 2025-12-04T10:01:22.2934912Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2935402Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2935738Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2936304Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2937491Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2938670Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2939910Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2941116Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2942275Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2943444Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2944642Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2945806Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2946980Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2948188Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2948441Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:22.2948638Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2948741Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2948810Z unimplemented [] 2025-12-04T10:01:22.2948911Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2949098Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2950297Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2950360Z graph_break [] 2025-12-04T10:01:22.2950498Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2950566Z Autotune Choices Stats: 2025-12-04T10:01:22.2951984Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2952268Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2952495Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2952815Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2953961Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2955106Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2956386Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2957623Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2958800Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2959933Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2960185Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:22.2960252Z Autotune Choices Stats: 2025-12-04T10:01:22.2961701Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2962210Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2962549Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2963106Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2964292Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2965471Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2966709Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2967916Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2969081Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2970259Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2971460Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2972627Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2973797Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2975020Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.2975311Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:22.2975442Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.2975511Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.2975580Z unimplemented [] 2025-12-04T10:01:22.2975682Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.2975867Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.2977071Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.2977134Z graph_break [] 2025-12-04T10:01:22.2977271Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.2977339Z Autotune Choices Stats: 2025-12-04T10:01:22.2978753Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.2979041Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2979267Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2979581Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2980722Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2981858Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2983060Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.2984224Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.2985360Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2986494Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2986752Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:22.2986863Z Autotune Choices Stats: 2025-12-04T10:01:22.2988374Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.2988820Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.2989165Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.2989726Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.2990911Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2992151Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2993367Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2994546Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.2995715Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.2996919Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.2998094Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.2999272Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3000499Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3001689Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3001939Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:22.3002072Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3002142Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3002212Z unimplemented [] 2025-12-04T10:01:22.3002316Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3002507Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3003704Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3003765Z graph_break [] 2025-12-04T10:01:22.3003935Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3004002Z Autotune Choices Stats: 2025-12-04T10:01:22.3005426Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3005672Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3005893Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3006214Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3007355Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3008484Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3009684Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3010848Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3011977Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3013107Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3013395Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:22.3013466Z Autotune Choices Stats: 2025-12-04T10:01:22.3014917Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3015369Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3015707Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3016266Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3017518Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3018718Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3019908Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3021087Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3022290Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3023467Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3024631Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3025806Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3027062Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3028299Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3028553Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:22.3028686Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3028755Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3028827Z unimplemented [] 2025-12-04T10:01:22.3028929Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3029115Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3030316Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3030417Z graph_break [] 2025-12-04T10:01:22.3030567Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3030638Z Autotune Choices Stats: 2025-12-04T10:01:22.3032073Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3032319Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3032546Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3032863Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3034005Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3035196Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3036354Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3037490Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3038623Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3039793Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3040045Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:22.3040112Z Autotune Choices Stats: 2025-12-04T10:01:22.3041570Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.3042013Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3042352Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3042980Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3044168Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3045385Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3046559Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3047735Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3048942Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3050115Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3051288Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3052520Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3053733Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3054904Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3055167Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:22.3055476Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3055547Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3055618Z unimplemented [] 2025-12-04T10:01:22.3055827Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3056017Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3057216Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3057280Z graph_break [] 2025-12-04T10:01:22.3057417Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3057483Z Autotune Choices Stats: 2025-12-04T10:01:22.3058899Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.3059151Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3059375Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3059692Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3060944Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3062141Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3063292Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3064425Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3065601Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3066730Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3066982Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:22.3067049Z Autotune Choices Stats: 2025-12-04T10:01:22.3068562Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3069003Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3069409Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3070002Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3071197Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3072372Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3073542Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3074754Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3075928Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3077105Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3078330Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3079542Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3080729Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3081908Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3082199Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:22.3082331Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3082402Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3082470Z unimplemented [] 2025-12-04T10:01:22.3082573Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3082758Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3083961Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3084023Z graph_break [] 2025-12-04T10:01:22.3084156Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3084223Z Autotune Choices Stats: 2025-12-04T10:01:22.3085642Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:22.3085892Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3086121Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3086502Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3087668Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3088793Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3089925Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3091051Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3092237Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3093364Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3093617Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.3093684Z Autotune Choices Stats: 2025-12-04T10:01:22.3095204Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3095644Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3096023Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3096582Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3097772Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3098940Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3100147Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3101318Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3102484Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3103655Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3104880Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3106092Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3107297Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3108477Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3108767Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:22.3108899Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3108968Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3109037Z unimplemented [] 2025-12-04T10:01:22.3109139Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3109324Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3110539Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3110604Z graph_break [] 2025-12-04T10:01:22.3110738Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3110805Z Autotune Choices Stats: 2025-12-04T10:01:22.3112221Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3112536Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3112799Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3113113Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3114261Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3115411Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3116548Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3117710Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3118840Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3119972Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3120224Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:22.3120295Z Autotune Choices Stats: 2025-12-04T10:01:22.3121816Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3122307Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3122645Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3123206Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3124390Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3125620Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3126792Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3127962Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3129133Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3130368Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3131560Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3132732Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3133897Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3135134Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3135430Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:22.3135583Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3135664Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3135758Z unimplemented [] 2025-12-04T10:01:22.3135880Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3136115Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3137497Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3137570Z graph_break [] 2025-12-04T10:01:22.3137703Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3137772Z Autotune Choices Stats: 2025-12-04T10:01:22.3139246Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:22.3139527Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3139746Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3140065Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3141210Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3142346Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3143518Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3144650Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3145965Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3147288Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3147600Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:22.3147699Z Autotune Choices Stats: 2025-12-04T10:01:22.3149153Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3149594Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3149929Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3150488Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3151683Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3152890Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3154066Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3155350Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3156667Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3157896Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3159075Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3160249Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3161468Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3162645Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3162891Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:22.3163032Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3163103Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3163172Z unimplemented [] 2025-12-04T10:01:22.3163273Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3163456Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3164662Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3164726Z graph_break [] 2025-12-04T10:01:22.3164930Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3165034Z Autotune Choices Stats: 2025-12-04T10:01:22.3166454Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3166700Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3166926Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3167244Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3168386Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3169560Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3170696Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3171825Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3172956Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3174157Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3174444Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.3174510Z Autotune Choices Stats: 2025-12-04T10:01:22.3175975Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3176418Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3176768Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3177325Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3178558Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3179730Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3180914Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3182150Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3183334Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3184542Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3185723Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3186898Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3188162Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3189346Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3189594Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:22.3189730Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3189803Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3189874Z unimplemented [] 2025-12-04T10:01:22.3189977Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3190161Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3191457Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3191600Z graph_break [] 2025-12-04T10:01:22.3191737Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3191805Z Autotune Choices Stats: 2025-12-04T10:01:22.3193224Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.3193478Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3193697Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3194016Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3195157Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3196327Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3197458Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3198595Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3199796Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3200967Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3201216Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:22.3201300Z Autotune Choices Stats: 2025-12-04T10:01:22.3202750Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.3203192Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3203569Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3204127Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3205317Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3206490Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3207665Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3208921Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3210130Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3211311Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3212477Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3213685Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3214853Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3216032Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3216277Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:22.3216413Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3216483Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3216551Z unimplemented [] 2025-12-04T10:01:22.3216717Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3216904Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3218135Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3218199Z graph_break [] 2025-12-04T10:01:22.3218349Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3218418Z Autotune Choices Stats: 2025-12-04T10:01:22.3219830Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3220082Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3220303Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3220678Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3221818Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3222953Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3224079Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3225278Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3226412Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3227617Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3227870Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:22.3227939Z Autotune Choices Stats: 2025-12-04T10:01:22.3229394Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3229874Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3230212Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3230773Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3231961Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3233134Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3234371Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3235576Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3236751Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3237926Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3239129Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3240304Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3241475Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3242653Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3242963Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:22.3243134Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3243205Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3243274Z unimplemented [] 2025-12-04T10:01:22.3243378Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3243565Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3244770Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3244837Z graph_break [] 2025-12-04T10:01:22.3244973Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3245040Z Autotune Choices Stats: 2025-12-04T10:01:22.3246455Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3246746Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3246964Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3247285Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3248425Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3249562Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3250695Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3251892Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3253082Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3254245Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3254502Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:22.3254569Z Autotune Choices Stats: 2025-12-04T10:01:22.3256209Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.3256719Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3257059Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3257623Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3258825Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3260109Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3261326Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3262505Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3263682Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3264892Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3266063Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3267292Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3268466Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3269713Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3270002Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:22.3270183Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:22.3270264Z Traceback (most recent call last): 2025-12-04T10:01:22.3270571Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:22.3270638Z self.assertTrue( 2025-12-04T10:01:22.3270839Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:22.3270929Z raise self.failureException(msg) 2025-12-04T10:01:22.3271173Z AssertionError: False is not true : Log file /tmp/tmpm3w3a3p6/flex_attention_configs.json was not created 2025-12-04T10:01:22.3271179Z 2025-12-04T10:01:22.3271316Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:22.3271576Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:22.3271579Z 2025-12-04T10:01:22.3271744Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:22.3271883Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3271991Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3272056Z unimplemented [] 2025-12-04T10:01:22.3272167Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3273378Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:22.3273570Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3273630Z graph_break [] 2025-12-04T10:01:22.3273763Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3274778Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:22.3274861Z current_size = base.storage().size() 2025-12-04T10:01:22.3274936Z Autotune Choices Stats: 2025-12-04T10:01:22.3276363Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:22.3276619Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3276911Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3277262Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3278405Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3279540Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3280668Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3281825Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3282954Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3284093Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3284342Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:22.3284419Z Autotune Choices Stats: 2025-12-04T10:01:22.3285971Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:22.3286457Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3286791Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3287360Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3288552Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3289734Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3290940Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3292114Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3293286Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3294510Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3295720Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3296882Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3298052Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3299247Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3299496Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:22.3299632Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3299710Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3299775Z unimplemented [] 2025-12-04T10:01:22.3299885Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3300074Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3301279Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3301351Z graph_break [] 2025-12-04T10:01:22.3301481Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3301556Z Autotune Choices Stats: 2025-12-04T10:01:22.3303032Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3303318Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3303540Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3303857Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3305002Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3306140Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3307327Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3308509Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3309637Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3310775Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3311023Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:22.3311101Z Autotune Choices Stats: 2025-12-04T10:01:22.3312605Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3313085Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3313418Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3313989Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3315173Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3316385Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3317556Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3318728Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3319897Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3321144Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3322344Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3323512Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3324690Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3325892Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3326139Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:22.3326280Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3326356Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3326418Z unimplemented [] 2025-12-04T10:01:22.3326527Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3326713Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3327910Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3327982Z graph_break [] 2025-12-04T10:01:22.3328110Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3328179Z Autotune Choices Stats: 2025-12-04T10:01:22.3329676Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3329960Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3330175Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3330490Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3331632Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3332761Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3333936Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3335074Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3336200Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3337397Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3337673Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:22.3337747Z Autotune Choices Stats: 2025-12-04T10:01:22.3339202Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3339650Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3339982Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3340550Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3341729Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3342940Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3344115Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3345283Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3346534Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3347773Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3348952Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3350114Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3351365Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3352536Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3352800Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:22.3352934Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3353014Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3353077Z unimplemented [] 2025-12-04T10:01:22.3353180Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3353372Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3354642Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3354714Z graph_break [] 2025-12-04T10:01:22.3354876Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3354958Z Autotune Choices Stats: 2025-12-04T10:01:22.3356766Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3357067Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3357326Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3357650Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3358802Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3360011Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3361153Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3362297Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3363424Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3364653Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3364985Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:22.3365073Z Autotune Choices Stats: 2025-12-04T10:01:22.3366797Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3367243Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3367575Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3368183Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3369367Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3370546Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3371713Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3372957Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3374167Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3375338Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3376513Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3377714Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3378891Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3380064Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3380330Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:22.3380462Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3380538Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3380600Z unimplemented [] 2025-12-04T10:01:22.3380706Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3380896Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3382165Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3382285Z graph_break [] 2025-12-04T10:01:22.3382413Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3382478Z Autotune Choices Stats: 2025-12-04T10:01:22.3383900Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3384153Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3384374Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3384688Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3385893Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3387019Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3388247Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3389378Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3390578Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3391757Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3392005Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:22.3392078Z Autotune Choices Stats: 2025-12-04T10:01:22.3393530Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3393975Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3394347Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3394915Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3396111Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3397286Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3398459Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3399704Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3400902Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3402073Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3403255Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3404457Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3405644Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3406812Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3407065Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:22.3407193Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3407272Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3407403Z unimplemented [] 2025-12-04T10:01:22.3407508Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3407732Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3408925Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3408994Z graph_break [] 2025-12-04T10:01:22.3409122Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3409190Z Autotune Choices Stats: 2025-12-04T10:01:22.3410604Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3410852Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3411114Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3411431Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3412584Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3413719Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3414853Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3416074Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3417232Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3418368Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3418617Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:22.3418692Z Autotune Choices Stats: 2025-12-04T10:01:22.3420149Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3420636Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3420972Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3421545Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3422730Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3423911Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3425149Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3426357Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3427606Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3428794Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3430000Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3431164Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3432342Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3433567Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3433852Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:22.3433983Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3434060Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3434123Z unimplemented [] 2025-12-04T10:01:22.3434230Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3434423Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3435631Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3435705Z graph_break [] 2025-12-04T10:01:22.3435834Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3435902Z Autotune Choices Stats: 2025-12-04T10:01:22.3437321Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3437606Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3437833Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3438149Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3439298Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3440428Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3441567Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3442772Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3443935Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3445068Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3445313Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:22.3445385Z Autotune Choices Stats: 2025-12-04T10:01:22.3446878Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.3447324Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3447658Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3448224Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3449412Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3450676Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3451873Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3453049Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3454231Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3455648Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3456823Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3457997Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3459171Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3460442Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3460743Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:22.3460876Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3460953Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3461017Z unimplemented [] 2025-12-04T10:01:22.3461119Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3461313Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3462511Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3462579Z graph_break [] 2025-12-04T10:01:22.3462708Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3462777Z Autotune Choices Stats: 2025-12-04T10:01:22.3464269Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.3464515Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3464740Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3465058Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3466215Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3467390Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3468628Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3469791Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3470929Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3472075Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3472362Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:22.3472436Z Autotune Choices Stats: 2025-12-04T10:01:22.3473887Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3474337Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3474670Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3475231Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3476495Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3477676Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3478878Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3480052Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3481222Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3482450Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3483625Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3484797Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3486038Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3487233Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3487491Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:22.3487625Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3487705Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3487769Z unimplemented [] 2025-12-04T10:01:22.3487871Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3488070Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3489268Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3489379Z graph_break [] 2025-12-04T10:01:22.3489516Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3489584Z Autotune Choices Stats: 2025-12-04T10:01:22.3491002Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:22.3491248Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3491480Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3491798Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3492957Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3494155Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3495330Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3496473Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3497616Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3498784Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3499034Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.3499102Z Autotune Choices Stats: 2025-12-04T10:01:22.3500560Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3501014Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3501344Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3501909Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3503168Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3504379Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3505561Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3506740Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3507991Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3509168Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3510350Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3511591Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3512811Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3513987Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3514240Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:22.3514368Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3514444Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3514507Z unimplemented [] 2025-12-04T10:01:22.3514607Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3514797Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3516044Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3516110Z graph_break [] 2025-12-04T10:01:22.3516237Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3516304Z Autotune Choices Stats: 2025-12-04T10:01:22.3517725Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3517976Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3518203Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3518517Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3519736Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3520905Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3522041Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3523181Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3524313Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3525498Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3525754Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:22.3525823Z Autotune Choices Stats: 2025-12-04T10:01:22.3527287Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3527732Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3528133Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3528696Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3529923Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3531103Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3532275Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3533484Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3534660Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3535842Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3537015Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3538249Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3539463Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3540638Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3540892Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:22.3541058Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3541133Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3541197Z unimplemented [] 2025-12-04T10:01:22.3541304Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3541494Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3542696Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3542765Z graph_break [] 2025-12-04T10:01:22.3542895Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3542962Z Autotune Choices Stats: 2025-12-04T10:01:22.3544387Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:22.3544634Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3544861Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3545210Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3546681Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3548075Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3549225Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3550358Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3551523Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3552659Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3552907Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:22.3552975Z Autotune Choices Stats: 2025-12-04T10:01:22.3554429Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3554965Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3556118Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3556798Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3558012Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3559189Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3560445Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3561619Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3562794Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3563976Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3565232Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3566463Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3567645Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3568820Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3569115Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:22.3569250Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3569328Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3569392Z unimplemented [] 2025-12-04T10:01:22.3569495Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3569688Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3570899Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3570968Z graph_break [] 2025-12-04T10:01:22.3571103Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3571169Z Autotune Choices Stats: 2025-12-04T10:01:22.3572590Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3572902Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3573132Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3573477Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3574622Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3575757Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3576890Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3578055Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3579186Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3580321Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3580569Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.3580637Z Autotune Choices Stats: 2025-12-04T10:01:22.3582182Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3582658Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3582994Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3583560Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3584756Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3585936Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3587153Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3588380Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3589556Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3590804Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3592005Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3593182Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3594357Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3595560Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3595817Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:22.3595948Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3596019Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3596089Z unimplemented [] 2025-12-04T10:01:22.3596192Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3596384Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3597585Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3597655Z graph_break [] 2025-12-04T10:01:22.3597784Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3597852Z Autotune Choices Stats: 2025-12-04T10:01:22.3599344Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.3599623Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3599852Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3600165Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3601320Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3602453Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3603645Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3604773Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3605911Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3607049Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3607295Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:22.3607362Z Autotune Choices Stats: 2025-12-04T10:01:22.3608889Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.3609375Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3609720Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3610287Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3611488Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3612713Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3613901Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3615071Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3616312Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3617625Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3618793Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3619973Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3621142Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3622344Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3622599Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:22.3622731Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3622805Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3622882Z unimplemented [] 2025-12-04T10:01:22.3622992Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3623188Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3624387Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3624450Z graph_break [] 2025-12-04T10:01:22.3624588Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3624655Z Autotune Choices Stats: 2025-12-04T10:01:22.3626153Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3626446Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3626675Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3626995Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3628200Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3629337Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3630519Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3631647Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3632788Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3633983Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3634269Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:22.3634339Z Autotune Choices Stats: 2025-12-04T10:01:22.3635803Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3636248Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3636585Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3637147Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3638381Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3639556Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3640754Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3641932Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3643168Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3644390Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3645570Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3646751Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3647952Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3649114Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3649366Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:22.3649498Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3649567Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3649637Z unimplemented [] 2025-12-04T10:01:22.3649738Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3649933Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3651198Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3651304Z graph_break [] 2025-12-04T10:01:22.3651441Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3651510Z Autotune Choices Stats: 2025-12-04T10:01:22.3652931Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3653177Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3653404Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3653721Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3654866Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3656204Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3657344Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3658483Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3659721Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3660855Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3661152Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:22.3661221Z Autotune Choices Stats: 2025-12-04T10:01:22.3662688Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.3663126Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3663462Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3664074Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3665427Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3666599Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3667827Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3669070Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3670286Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3671463Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3672635Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3673841Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3675029Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3676207Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3676459Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:22.3676592Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3676665Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3676735Z unimplemented [] 2025-12-04T10:01:22.3676839Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3677107Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3678311Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3678408Z graph_break [] 2025-12-04T10:01:22.3678546Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3678616Z Autotune Choices Stats: 2025-12-04T10:01:22.3680029Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3680274Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3680506Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3680862Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3682017Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3683142Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3684284Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3685426Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3686631Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3687790Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3688047Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:22.3688117Z Autotune Choices Stats: 2025-12-04T10:01:22.3689570Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3690055Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3690399Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3690961Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3692156Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3693326Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3694563Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3695769Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3696939Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3698112Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3699327Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3700508Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3701677Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3702838Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3703154Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:22.3703333Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:22.3703445Z Traceback (most recent call last): 2025-12-04T10:01:22.3703756Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:22.3703822Z self.assertTrue( 2025-12-04T10:01:22.3704019Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:22.3704107Z raise self.failureException(msg) 2025-12-04T10:01:22.3704354Z AssertionError: False is not true : Log file /tmp/tmpxg4d7rlo/flex_attention_configs.json was not created 2025-12-04T10:01:22.3704363Z 2025-12-04T10:01:22.3704511Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:22.3704766Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:22.3704769Z 2025-12-04T10:01:22.3704933Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:22.3705072Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3705142Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3705210Z unimplemented [] 2025-12-04T10:01:22.3705312Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3706524Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:22.3706758Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3706822Z graph_break [] 2025-12-04T10:01:22.3706959Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3708024Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:22.3708111Z current_size = base.storage().size() 2025-12-04T10:01:22.3708182Z Autotune Choices Stats: 2025-12-04T10:01:22.3709602Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:22.3709862Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3710088Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3710414Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3711643Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3712809Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3713944Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3715084Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3716257Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3717390Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3717643Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:22.3717713Z Autotune Choices Stats: 2025-12-04T10:01:22.3719166Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:22.3719611Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3720009Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3720625Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3721808Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3722992Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3724165Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3725361Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3726540Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3727713Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3728953Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3730164Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3731342Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3732517Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3732795Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:22.3732932Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3733007Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3733070Z unimplemented [] 2025-12-04T10:01:22.3733183Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3733370Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3734574Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3734637Z graph_break [] 2025-12-04T10:01:22.3734769Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3734844Z Autotune Choices Stats: 2025-12-04T10:01:22.3742707Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3743041Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3743278Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3743712Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3744920Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3746068Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3747198Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3748436Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3749602Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3750749Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3751009Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:22.3751083Z Autotune Choices Stats: 2025-12-04T10:01:22.3752604Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3753061Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3753430Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3754002Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3755373Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3756599Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3757861Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3759027Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3760203Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3761369Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3762636Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3763868Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3765032Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3766208Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3766502Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:22.3766649Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3766722Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3766787Z unimplemented [] 2025-12-04T10:01:22.3766903Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3767096Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3768304Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3768373Z graph_break [] 2025-12-04T10:01:22.3768516Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3768593Z Autotune Choices Stats: 2025-12-04T10:01:22.3770102Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3770369Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3770626Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3770953Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3772108Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3773247Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3774386Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3775554Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3776680Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3777811Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3778070Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:22.3778145Z Autotune Choices Stats: 2025-12-04T10:01:22.3779675Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3780166Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3780500Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3781078Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3782254Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3783471Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3784638Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3785812Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3786985Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3788291Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3789506Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3790676Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3791834Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3793041Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3793296Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:22.3793439Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3793512Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3793577Z unimplemented [] 2025-12-04T10:01:22.3793688Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3793883Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3795107Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3795172Z graph_break [] 2025-12-04T10:01:22.3795306Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3795382Z Autotune Choices Stats: 2025-12-04T10:01:22.3796878Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3797169Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3797393Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3797721Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3798864Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3799992Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3801163Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3802303Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3803429Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3804562Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3804874Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:22.3804951Z Autotune Choices Stats: 2025-12-04T10:01:22.3806432Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3806885Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3807218Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3807790Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3808966Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3810180Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3811362Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3812561Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3813808Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3815021Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3816209Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3817379Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3818586Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3819767Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3820022Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:22.3820168Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3820245Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3820310Z unimplemented [] 2025-12-04T10:01:22.3820421Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3820608Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3821821Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3821884Z graph_break [] 2025-12-04T10:01:22.3822092Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3822176Z Autotune Choices Stats: 2025-12-04T10:01:22.3823627Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3823887Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3824113Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3824434Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3825589Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3826763Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3827944Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3829088Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3830211Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3831431Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3831734Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:22.3831811Z Autotune Choices Stats: 2025-12-04T10:01:22.3833269Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3833716Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3834048Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3834616Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3835845Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3837022Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3838192Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3839428Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3840604Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3841808Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3842980Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3844143Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3845348Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3846523Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3846773Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:22.3846911Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3846981Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3847044Z unimplemented [] 2025-12-04T10:01:22.3847154Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3847341Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3848608Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3848707Z graph_break [] 2025-12-04T10:01:22.3848838Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3848911Z Autotune Choices Stats: 2025-12-04T10:01:22.3850325Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3850618Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3850991Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3851330Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3852478Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3853658Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3854785Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3856051Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3857298Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3858483Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3858729Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:22.3858806Z Autotune Choices Stats: 2025-12-04T10:01:22.3860265Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3860712Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3861120Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3861688Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3862874Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3864057Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3865226Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3866469Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3867724Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3868902Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3870088Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3871301Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3872478Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3873659Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3873911Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:22.3874045Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3874124Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3874187Z unimplemented [] 2025-12-04T10:01:22.3874297Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3874549Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3875784Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3875854Z graph_break [] 2025-12-04T10:01:22.3875987Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3876067Z Autotune Choices Stats: 2025-12-04T10:01:22.3877493Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3877751Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3877974Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3878328Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3879474Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3880606Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3881744Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3882885Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3884086Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3885264Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3885510Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:22.3885588Z Autotune Choices Stats: 2025-12-04T10:01:22.3887041Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.3887524Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3887858Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3888428Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3889611Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3890788Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3892021Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3893247Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3894422Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3895589Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3896810Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3897980Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3899159Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3900330Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3900644Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:22.3900809Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3900883Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3900946Z unimplemented [] 2025-12-04T10:01:22.3901058Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3901248Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3902447Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3902517Z graph_break [] 2025-12-04T10:01:22.3902653Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3902727Z Autotune Choices Stats: 2025-12-04T10:01:22.3904142Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.3904431Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3904657Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3904979Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3906128Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3907320Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3908462Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3909669Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3910828Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3911963Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3912210Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:22.3912285Z Autotune Choices Stats: 2025-12-04T10:01:22.3913741Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3914226Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3914556Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3915126Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3916307Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3917548Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3918746Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3919926Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3921102Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3922266Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3923470Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3924637Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3925808Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3927067Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3927352Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:22.3927487Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3927566Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3927631Z unimplemented [] 2025-12-04T10:01:22.3927744Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3927932Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3929134Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3929206Z graph_break [] 2025-12-04T10:01:22.3929336Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3929410Z Autotune Choices Stats: 2025-12-04T10:01:22.3930816Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:22.3931112Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3931334Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3931653Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3932809Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3933940Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3935150Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3936322Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3937453Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3938588Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3938867Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.3938941Z Autotune Choices Stats: 2025-12-04T10:01:22.3940397Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3940844Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3941181Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3941752Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3942933Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3944190Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3945396Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3946580Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3947804Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3949013Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3950195Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3951363Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3952603Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3953808Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3954060Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:22.3954191Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3954270Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3954334Z unimplemented [] 2025-12-04T10:01:22.3954437Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3954642Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3956000Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3956148Z graph_break [] 2025-12-04T10:01:22.3956282Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3956353Z Autotune Choices Stats: 2025-12-04T10:01:22.3957780Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.3958039Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3958262Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3958583Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3959731Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3960969Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3962167Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3963310Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3964435Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3965580Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3965863Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:22.3965937Z Autotune Choices Stats: 2025-12-04T10:01:22.3967399Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3967847Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3968182Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3968750Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3970033Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3971253Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3972438Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3973620Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3974847Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3976239Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3977600Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.3978832Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.3980003Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3981204Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.3981462Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:22.3981595Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.3981671Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.3981734Z unimplemented [] 2025-12-04T10:01:22.3981838Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.3982032Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.3983234Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.3983342Z graph_break [] 2025-12-04T10:01:22.3983471Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.3983545Z Autotune Choices Stats: 2025-12-04T10:01:22.3984972Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:22.3985239Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3985469Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3985785Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3986940Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3988175Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3989362Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.3990500Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.3991626Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3992812Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3993068Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:22.3993143Z Autotune Choices Stats: 2025-12-04T10:01:22.3994605Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.3995103Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.3995504Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.3996261Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.3997531Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.3998717Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.3999896Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4001105Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4002281Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4003459Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4004638Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4005883Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4007091Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4008263Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4008527Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:22.4008663Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4008774Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4008842Z unimplemented [] 2025-12-04T10:01:22.4008945Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4009141Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4010340Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4010409Z graph_break [] 2025-12-04T10:01:22.4010538Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4010608Z Autotune Choices Stats: 2025-12-04T10:01:22.4012035Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4012293Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4012516Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4012831Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4014037Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4015203Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4016348Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4017486Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4018653Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4019784Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4020032Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.4020103Z Autotune Choices Stats: 2025-12-04T10:01:22.4021559Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4022002Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4022401Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4023020Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4024202Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4025380Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4026551Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4027853Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4029027Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4030206Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4031458Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4032651Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4033826Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4035004Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4035299Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:22.4035434Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4035510Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4035575Z unimplemented [] 2025-12-04T10:01:22.4035677Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4035891Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4037087Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4037156Z graph_break [] 2025-12-04T10:01:22.4037289Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4037358Z Autotune Choices Stats: 2025-12-04T10:01:22.4038780Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.4039029Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4039322Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4039673Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4040825Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4041962Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4043101Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4044269Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4045398Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4046536Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4046782Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:22.4046854Z Autotune Choices Stats: 2025-12-04T10:01:22.4048380Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.4048867Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4049198Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4049764Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4050954Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4052132Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4053338Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4054520Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4055876Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4057183Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4058409Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4059584Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4060755Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4061979Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4062232Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:22.4062366Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4062442Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4062508Z unimplemented [] 2025-12-04T10:01:22.4062611Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4062808Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4064008Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4064080Z graph_break [] 2025-12-04T10:01:22.4064209Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4064276Z Autotune Choices Stats: 2025-12-04T10:01:22.4065773Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4066053Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4066277Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4066594Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4067800Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4068932Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4070081Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4071267Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4072405Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4073542Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4073792Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:22.4073865Z Autotune Choices Stats: 2025-12-04T10:01:22.4075389Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4075973Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4076311Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4076875Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4078064Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4079290Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4080471Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4081638Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4082806Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4084043Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4085257Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4086426Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4087604Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4088811Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4089060Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:22.4089191Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4089264Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4089330Z unimplemented [] 2025-12-04T10:01:22.4089434Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4089630Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4090833Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4090898Z graph_break [] 2025-12-04T10:01:22.4091031Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4091097Z Autotune Choices Stats: 2025-12-04T10:01:22.4092604Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4092879Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4093102Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4093419Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4094572Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4095703Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4096873Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4098011Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4099159Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4100366Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4100653Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:22.4100725Z Autotune Choices Stats: 2025-12-04T10:01:22.4102187Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.4102637Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4102973Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4103533Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4104735Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4106168Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4107482Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4108663Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4109927Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4111138Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4112318Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4113493Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4114697Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4115868Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4116124Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:22.4116256Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4116334Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4116398Z unimplemented [] 2025-12-04T10:01:22.4116501Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4116699Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4117971Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4118042Z graph_break [] 2025-12-04T10:01:22.4118217Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4118288Z Autotune Choices Stats: 2025-12-04T10:01:22.4119703Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4119956Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4120184Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4120499Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4121646Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4122831Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4123968Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4125104Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4126258Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4127470Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4127752Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:22.4127821Z Autotune Choices Stats: 2025-12-04T10:01:22.4129280Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4129727Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4130062Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4130662Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4131851Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4133034Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4134214Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4135473Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4136681Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4137870Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4139053Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4140260Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4141434Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4142612Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4142869Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:22.4143000Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4143076Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4143141Z unimplemented [] 2025-12-04T10:01:22.4143246Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4143437Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4144703Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4144804Z graph_break [] 2025-12-04T10:01:22.4144933Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4145000Z Autotune Choices Stats: 2025-12-04T10:01:22.4146426Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4146672Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4146899Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4147264Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4148466Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4149597Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4150736Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4151871Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4153076Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4154266Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4154514Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:22.4154580Z Autotune Choices Stats: 2025-12-04T10:01:22.4156346Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4156804Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4157226Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4157792Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4158990Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4160178Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4161452Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4162634Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4163851Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4165033Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4166216Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4167424Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4168601Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4169777Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4170033Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:22.4170211Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:22.4170360Z Traceback (most recent call last): 2025-12-04T10:01:22.4170670Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:22.4170771Z self.assertTrue( 2025-12-04T10:01:22.4170980Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:22.4171064Z raise self.failureException(msg) 2025-12-04T10:01:22.4171311Z AssertionError: False is not true : Log file /tmp/tmp51t4iifl/flex_attention_configs.json was not created 2025-12-04T10:01:22.4171316Z 2025-12-04T10:01:22.4171459Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:22.4171719Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:22.4171724Z 2025-12-04T10:01:22.4171894Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:22.4172031Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4172102Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4172175Z unimplemented [] 2025-12-04T10:01:22.4172282Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4173502Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:22.4173730Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4173802Z graph_break [] 2025-12-04T10:01:22.4173947Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4174952Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:22.4175042Z current_size = base.storage().size() 2025-12-04T10:01:22.4175112Z Autotune Choices Stats: 2025-12-04T10:01:22.4176543Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:22.4176798Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4177022Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4177350Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4178586Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4179764Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4180910Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4182048Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4183225Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4184364Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4184623Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:22.4184692Z Autotune Choices Stats: 2025-12-04T10:01:22.4186166Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:22.4186615Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4187050Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4187731Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4188970Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4190152Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4191335Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4192550Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4193732Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4194921Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4196166Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4197352Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4198554Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4199736Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4199987Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:22.4200164Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4200236Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4200306Z unimplemented [] 2025-12-04T10:01:22.4200413Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4200602Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4201820Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4201885Z graph_break [] 2025-12-04T10:01:22.4202021Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4202090Z Autotune Choices Stats: 2025-12-04T10:01:22.4203519Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4203771Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4203994Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4204320Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4205544Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4206723Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4207871Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4209012Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4210182Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4211316Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4211569Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:22.4211637Z Autotune Choices Stats: 2025-12-04T10:01:22.4213114Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4213620Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4214027Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4214593Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4215782Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4216965Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4218187Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4219368Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4220550Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4221727Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4222979Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4224190Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4225359Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4226541Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4226825Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:22.4226964Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4227034Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4227101Z unimplemented [] 2025-12-04T10:01:22.4227203Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4227450Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4228655Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4228719Z graph_break [] 2025-12-04T10:01:22.4228855Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4228925Z Autotune Choices Stats: 2025-12-04T10:01:22.4230343Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4230667Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4230894Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4231252Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4232397Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4233538Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4234674Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4235847Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4236986Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4238126Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4238378Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:22.4238449Z Autotune Choices Stats: 2025-12-04T10:01:22.4239981Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4240455Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4240794Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4241366Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4242555Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4243735Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4245009Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4246193Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4247373Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4248617Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4249820Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4251002Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4252172Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4253396Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4253646Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:22.4253789Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4253861Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4253933Z unimplemented [] 2025-12-04T10:01:22.4254042Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4254229Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4255614Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4255683Z graph_break [] 2025-12-04T10:01:22.4255823Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4255895Z Autotune Choices Stats: 2025-12-04T10:01:22.4257584Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4257898Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4258125Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4258453Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4259608Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4260759Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4261951Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4263096Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4264230Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4265369Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4265623Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:22.4265692Z Autotune Choices Stats: 2025-12-04T10:01:22.4267288Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4267775Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4268126Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4268701Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4269905Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4271119Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4272302Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4273491Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4274739Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4275928Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4277134Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4278317Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4279493Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4280706Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4280956Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:22.4281100Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4281172Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4281236Z unimplemented [] 2025-12-04T10:01:22.4281345Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4281535Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4282746Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4282810Z graph_break [] 2025-12-04T10:01:22.4282949Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4283018Z Autotune Choices Stats: 2025-12-04T10:01:22.4284525Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4284833Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4285059Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4285396Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4286549Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4287693Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4288872Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4290013Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4291153Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4292359Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4292645Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:22.4292712Z Autotune Choices Stats: 2025-12-04T10:01:22.4294184Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4294630Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4294973Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4295543Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4296770Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4297950Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4299143Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4300325Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4301569Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4302777Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4303958Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4305142Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4306346Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4307611Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4307863Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:22.4308003Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4308073Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4308136Z unimplemented [] 2025-12-04T10:01:22.4308244Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4308430Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4309701Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4309799Z graph_break [] 2025-12-04T10:01:22.4309930Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4310006Z Autotune Choices Stats: 2025-12-04T10:01:22.4311433Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4311689Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4311915Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4312335Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4313651Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4314861Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4316009Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4317159Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4318396Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4319529Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4319814Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:22.4319884Z Autotune Choices Stats: 2025-12-04T10:01:22.4321357Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4321809Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4322151Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4322765Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4323953Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4325157Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4326350Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4327627Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4328844Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4330025Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4331206Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4332418Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4335661Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4336911Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4337175Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:22.4337316Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4337395Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4337459Z unimplemented [] 2025-12-04T10:01:22.4337572Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4337791Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4339056Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4339162Z graph_break [] 2025-12-04T10:01:22.4339296Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4339372Z Autotune Choices Stats: 2025-12-04T10:01:22.4340810Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4341078Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4341305Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4341627Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4342828Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4343954Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4345162Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4346308Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4347567Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4348747Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4349002Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:22.4349079Z Autotune Choices Stats: 2025-12-04T10:01:22.4350538Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.4351022Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4351358Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4351929Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4353202Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4354390Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4355791Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4357210Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4358389Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4359571Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4360833Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4362004Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4363233Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4364413Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4364671Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:22.4364843Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4364952Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4365020Z unimplemented [] 2025-12-04T10:01:22.4365125Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4365323Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4366528Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4366610Z graph_break [] 2025-12-04T10:01:22.4366745Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4366812Z Autotune Choices Stats: 2025-12-04T10:01:22.4368239Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.4368529Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4368776Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4369095Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4370253Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4371422Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4372561Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4373742Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4374908Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4376058Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4376305Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:22.4376381Z Autotune Choices Stats: 2025-12-04T10:01:22.4377841Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4378321Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4378653Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4379223Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4380455Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4381805Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4383045Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4384317Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4385489Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4386671Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4387960Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4389171Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4390355Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4391569Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4391871Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:22.4392008Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4392085Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4392150Z unimplemented [] 2025-12-04T10:01:22.4392256Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4392452Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4393657Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4393728Z graph_break [] 2025-12-04T10:01:22.4393863Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4393930Z Autotune Choices Stats: 2025-12-04T10:01:22.4395356Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:22.4395644Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4395872Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4396185Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4397379Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4398513Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4399678Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4400850Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4401990Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4403129Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4403374Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.4403481Z Autotune Choices Stats: 2025-12-04T10:01:22.4404939Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4405385Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4405752Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4406325Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4407514Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4408727Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4409935Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4411115Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4412291Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4413497Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4414710Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4415880Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4417087Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4418258Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4418576Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:22.4418710Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4418786Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4418851Z unimplemented [] 2025-12-04T10:01:22.4418957Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4419152Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4420351Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4420418Z graph_break [] 2025-12-04T10:01:22.4420586Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4420652Z Autotune Choices Stats: 2025-12-04T10:01:22.4422078Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4422324Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4422548Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4422902Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4424060Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4425189Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4426359Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4427600Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4428737Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4429881Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4430167Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:22.4430244Z Autotune Choices Stats: 2025-12-04T10:01:22.4431700Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4432184Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4432521Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4433083Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4434306Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4435522Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4436694Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4437872Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4439082Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4440259Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4441472Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4442643Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4443852Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4445050Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4445315Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:22.4445446Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4445522Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4445585Z unimplemented [] 2025-12-04T10:01:22.4445688Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4445884Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4447082Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4447184Z graph_break [] 2025-12-04T10:01:22.4447314Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4447383Z Autotune Choices Stats: 2025-12-04T10:01:22.4448804Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:22.4449084Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4449315Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4449632Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4450779Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4451956Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4453132Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4454273Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4455579Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4456810Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4457062Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:22.4457130Z Autotune Choices Stats: 2025-12-04T10:01:22.4458648Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4459106Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4459440Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4460047Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4461234Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4462459Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4463637Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4464814Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4466020Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4467282Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4468467Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4469678Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4470880Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4472051Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4472306Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:22.4472438Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4472512Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4472574Z unimplemented [] 2025-12-04T10:01:22.4472715Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4472904Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4474111Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4474178Z graph_break [] 2025-12-04T10:01:22.4474304Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4474370Z Autotune Choices Stats: 2025-12-04T10:01:22.4475827Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4476077Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4476302Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4476615Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4477808Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4478971Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4480107Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4481243Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4482414Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4483553Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4483869Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.4483938Z Autotune Choices Stats: 2025-12-04T10:01:22.4485401Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4485851Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4486229Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4486826Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4488008Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4489193Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4490372Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4491588Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4492788Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4493969Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4495183Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4496384Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4497561Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4498730Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4499029Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:22.4499160Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4499236Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4499298Z unimplemented [] 2025-12-04T10:01:22.4499402Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4499594Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4500792Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4500861Z graph_break [] 2025-12-04T10:01:22.4501024Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4501100Z Autotune Choices Stats: 2025-12-04T10:01:22.4502522Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.4502769Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4503000Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4503348Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4504537Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4505672Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4506810Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4508016Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4509209Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4510378Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4510626Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:22.4510697Z Autotune Choices Stats: 2025-12-04T10:01:22.4512189Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.4512630Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4513002Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4513559Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4514762Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4515946Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4517173Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4518357Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4519562Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4520736Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4521996Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4523196Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4524365Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4525538Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4525827Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:22.4525956Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4526032Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4526094Z unimplemented [] 2025-12-04T10:01:22.4526195Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4526386Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4527613Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4527683Z graph_break [] 2025-12-04T10:01:22.4527811Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4527878Z Autotune Choices Stats: 2025-12-04T10:01:22.4529340Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4529588Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4529920Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4530234Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4531393Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4532525Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4533666Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4534829Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4535997Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4537133Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4537382Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:22.4537449Z Autotune Choices Stats: 2025-12-04T10:01:22.4538940Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4539413Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4539749Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4540312Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4541813Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4543086Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4544269Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4545489Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4546658Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4547925Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4549147Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4550321Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4551491Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4552691Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4552950Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:22.4553095Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4553179Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4553243Z unimplemented [] 2025-12-04T10:01:22.4553353Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4553591Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4554798Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4554865Z graph_break [] 2025-12-04T10:01:22.4554999Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4555069Z Autotune Choices Stats: 2025-12-04T10:01:22.4556718Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4557020Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4557249Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4557568Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4558732Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4559859Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4561055Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4562186Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4563387Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4564528Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4564810Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:22.4564910Z Autotune Choices Stats: 2025-12-04T10:01:22.4566375Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.4566815Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4567160Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4567726Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4568920Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4570134Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4571352Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4572532Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4573733Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4574941Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4576118Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4577296Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4578503Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4579668Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4579956Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:22.4580093Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4580167Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4580236Z unimplemented [] 2025-12-04T10:01:22.4580349Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4580550Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4581753Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4581820Z graph_break [] 2025-12-04T10:01:22.4582000Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4582101Z Autotune Choices Stats: 2025-12-04T10:01:22.4583522Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4583767Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4583994Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4584312Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4585467Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4586631Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4587801Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4588970Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4590117Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4591283Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4591560Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:22.4591626Z Autotune Choices Stats: 2025-12-04T10:01:22.4593086Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4593522Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4593859Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4594416Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4595648Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4596861Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4598039Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4599247Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4600419Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4601629Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4602802Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4603984Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4605190Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4606385Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4606638Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:22.4606767Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4606836Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4606903Z unimplemented [] 2025-12-04T10:01:22.4607005Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4607196Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4608435Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4608533Z graph_break [] 2025-12-04T10:01:22.4608666Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4608733Z Autotune Choices Stats: 2025-12-04T10:01:22.4610143Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4610389Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4610612Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4610926Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4612072Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4613235Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4614421Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4615565Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4616730Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4617887Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4618139Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:22.4618207Z Autotune Choices Stats: 2025-12-04T10:01:22.4619659Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4620096Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4620472Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4621031Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4622219Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4623443Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4624627Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4625846Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4627046Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4628323Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4629496Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4630719Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4631937Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4633105Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4633358Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:22.4633490Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4633561Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4633632Z unimplemented [] 2025-12-04T10:01:22.4633768Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4633961Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4635190Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4635253Z graph_break [] 2025-12-04T10:01:22.4635389Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4635457Z Autotune Choices Stats: 2025-12-04T10:01:22.4636883Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4637126Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4637352Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4637702Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4638850Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4640016Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4641156Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4642513Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4643650Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4644816Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4645071Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:22.4645140Z Autotune Choices Stats: 2025-12-04T10:01:22.4646591Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4647084Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4647427Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4647983Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4649207Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4650386Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4651601Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4652806Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4653996Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4655171Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4656599Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4657883Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4659057Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4660277Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4660544Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:22.4660768Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:22.4660846Z Traceback (most recent call last): 2025-12-04T10:01:22.4661152Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:22.4661220Z self.assertTrue( 2025-12-04T10:01:22.4661421Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:22.4661501Z raise self.failureException(msg) 2025-12-04T10:01:22.4661747Z AssertionError: False is not true : Log file /tmp/tmp5bnad5h5/flex_attention_configs.json was not created 2025-12-04T10:01:22.4661753Z 2025-12-04T10:01:22.4661900Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:22.4662165Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:22.4662171Z 2025-12-04T10:01:22.4662335Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:22.4662474Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4662544Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4662613Z unimplemented [] 2025-12-04T10:01:22.4662717Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4663925Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:22.4664178Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4664241Z graph_break [] 2025-12-04T10:01:22.4664376Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4665374Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:22.4665459Z current_size = base.storage().size() 2025-12-04T10:01:22.4665529Z Autotune Choices Stats: 2025-12-04T10:01:22.4666999Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:22.4667333Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4667558Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4667882Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4669060Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4670222Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4671355Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4672482Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4673650Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4674812Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4675076Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:22.4675144Z Autotune Choices Stats: 2025-12-04T10:01:22.4676592Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:22.4677073Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4677414Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4678025Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4679216Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4680389Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4681559Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4682761Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4683971Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4685207Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4686635Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4687838Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4689003Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4690191Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4690483Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:22.4690621Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4690692Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4690756Z unimplemented [] 2025-12-04T10:01:22.4690867Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4691060Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4692296Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4692360Z graph_break [] 2025-12-04T10:01:22.4692490Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4692567Z Autotune Choices Stats: 2025-12-04T10:01:22.4693993Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4694246Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4694502Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4694870Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4696238Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4697520Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4698651Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4699822Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4700953Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4702115Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4702377Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:22.4702445Z Autotune Choices Stats: 2025-12-04T10:01:22.4703933Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4704412Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4704742Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4705411Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4706777Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4708004Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4709221Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4710416Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4711585Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4712799Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4714000Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4715171Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4716343Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4717554Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4717805Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:22.4717939Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4718011Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4718072Z unimplemented [] 2025-12-04T10:01:22.4718179Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4718363Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4719607Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4719672Z graph_break [] 2025-12-04T10:01:22.4719800Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4719873Z Autotune Choices Stats: 2025-12-04T10:01:22.4721324Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4721610Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4721828Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4722148Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4723292Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4724424Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4725553Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4726730Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4727886Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4729019Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4729269Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:22.4729335Z Autotune Choices Stats: 2025-12-04T10:01:22.4730817Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4731298Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4731632Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4732199Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4733376Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4734585Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4735771Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4736982Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4738156Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4739360Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4740567Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4741737Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4742917Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4744122Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4744369Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:22.4744505Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4744573Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4744687Z unimplemented [] 2025-12-04T10:01:22.4744806Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4745022Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4746444Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4746516Z graph_break [] 2025-12-04T10:01:22.4746668Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4746754Z Autotune Choices Stats: 2025-12-04T10:01:22.4748371Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4748654Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4748874Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4749197Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4754155Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4755678Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4756939Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4758370Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4759511Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4760695Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4760959Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:22.4761085Z Autotune Choices Stats: 2025-12-04T10:01:22.4762541Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4763004Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4763347Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4763911Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4765098Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4766307Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4767523Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4768701Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4769909Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4771108Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4772312Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4773529Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4774758Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4776183Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4776501Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:22.4776679Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4776761Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4776828Z unimplemented [] 2025-12-04T10:01:22.4776945Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4777144Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4778423Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4778489Z graph_break [] 2025-12-04T10:01:22.4778674Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4778750Z Autotune Choices Stats: 2025-12-04T10:01:22.4780198Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4780464Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4780694Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4781030Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4782206Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4783433Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4784601Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4785817Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4786975Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4788250Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4788546Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:22.4788623Z Autotune Choices Stats: 2025-12-04T10:01:22.4790119Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4790577Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4790913Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4791544Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4792750Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4793988Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4795276Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4796748Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4798081Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4799287Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4800494Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4801771Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4802984Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4804214Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4804477Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:22.4804620Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4804695Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4804762Z unimplemented [] 2025-12-04T10:01:22.4804892Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4805121Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4806630Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4806746Z graph_break [] 2025-12-04T10:01:22.4806879Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4806955Z Autotune Choices Stats: 2025-12-04T10:01:22.4808408Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4808667Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4808888Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4809224Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4810436Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4811612Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4812811Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4813984Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4815248Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4816692Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4816997Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:22.4817092Z Autotune Choices Stats: 2025-12-04T10:01:22.4818637Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4819095Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4819470Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4820060Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4821285Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4822540Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4823747Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4825038Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4826520Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4827806Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4829017Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4830267Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4831510Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4832714Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4832973Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:22.4833119Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4833226Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4833295Z unimplemented [] 2025-12-04T10:01:22.4833410Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4833641Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4834874Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4834945Z graph_break [] 2025-12-04T10:01:22.4835075Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4835154Z Autotune Choices Stats: 2025-12-04T10:01:22.4836620Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4836878Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4837137Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4837465Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4838645Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4839840Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4841001Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4842203Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4843399Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4844566Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4844825Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:22.4844898Z Autotune Choices Stats: 2025-12-04T10:01:22.4846392Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.4846889Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4847224Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4847807Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4849067Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4850276Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4851534Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4852774Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4853986Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4855351Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4856662Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4857935Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4859287Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4860558Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4860868Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:22.4861006Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4861085Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4861150Z unimplemented [] 2025-12-04T10:01:22.4861264Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4861459Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4862695Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4862779Z graph_break [] 2025-12-04T10:01:22.4862918Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4862997Z Autotune Choices Stats: 2025-12-04T10:01:22.4864469Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.4864772Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4864999Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4865335Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4866559Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4867810Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4869059Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4870261Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4871454Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4872616Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4872879Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:22.4872956Z Autotune Choices Stats: 2025-12-04T10:01:22.4874491Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4874949Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4875291Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4875913Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4877139Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4878381Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4879614Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4880819Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4882027Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4883263Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4884465Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4885728Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4886938Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4888183Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4888476Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:22.4888614Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4888705Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4888777Z unimplemented [] 2025-12-04T10:01:22.4888896Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4889091Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4890332Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4890407Z graph_break [] 2025-12-04T10:01:22.4890544Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4890659Z Autotune Choices Stats: 2025-12-04T10:01:22.4892104Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:22.4892363Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4892586Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4892918Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4894137Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4895306Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4896509Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4897707Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4898868Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4900060Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4900358Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.4900435Z Autotune Choices Stats: 2025-12-04T10:01:22.4901933Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4902418Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4902758Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4903347Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4904599Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4905808Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4907039Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4908350Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4909559Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4910795Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4912049Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4913260Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4914509Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4915758Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4916020Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:22.4916158Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4916240Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4916306Z unimplemented [] 2025-12-04T10:01:22.4916415Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4916612Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4917845Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4917970Z graph_break [] 2025-12-04T10:01:22.4918105Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4918180Z Autotune Choices Stats: 2025-12-04T10:01:22.4919644Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4919904Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4920167Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4920495Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4921688Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4922880Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4924085Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4925267Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4926429Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4927635Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4927894Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:22.4927974Z Autotune Choices Stats: 2025-12-04T10:01:22.4929489Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4929951Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4930288Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4930872Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4932127Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4933376Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4934590Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4935806Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4937038Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4938271Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4939475Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4940713Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4942000Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4943222Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4943482Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:22.4943615Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4943693Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4943769Z unimplemented [] 2025-12-04T10:01:22.4943877Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4944077Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4945356Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4945431Z graph_break [] 2025-12-04T10:01:22.4945562Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4945637Z Autotune Choices Stats: 2025-12-04T10:01:22.4947128Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:22.4947442Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4947664Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4947992Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4949205Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4950413Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4951587Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4952760Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4953925Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4955148Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4955576Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:22.4955654Z Autotune Choices Stats: 2025-12-04T10:01:22.4957201Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4957654Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4958035Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4958632Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4959893Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4961111Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4962314Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4963570Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4964819Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4966024Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4967233Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4968467Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4969706Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4970917Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4971179Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:22.4971349Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4971431Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4971497Z unimplemented [] 2025-12-04T10:01:22.4971605Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4971813Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.4973050Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.4973120Z graph_break [] 2025-12-04T10:01:22.4973256Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.4973330Z Autotune Choices Stats: 2025-12-04T10:01:22.4974815Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.4975070Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4975289Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4975623Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4976845Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4978048Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4979225Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.4980404Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.4981604Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4982820Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4983081Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.4983157Z Autotune Choices Stats: 2025-12-04T10:01:22.4984651Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.4985161Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.4985530Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.4986111Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.4987372Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4988589Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4989827Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4991033Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4992272Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4993479Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.4994721Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.4996220Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.4997437Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.4998647Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.4998950Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:22.4999085Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.4999162Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.4999226Z unimplemented [] 2025-12-04T10:01:22.4999332Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.4999532Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5000810Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5000883Z graph_break [] 2025-12-04T10:01:22.5001015Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5001086Z Autotune Choices Stats: 2025-12-04T10:01:22.5002550Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.5002836Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5003063Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5003424Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5004602Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5005771Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5006942Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5008165Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5009363Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5010530Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5010788Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:22.5010865Z Autotune Choices Stats: 2025-12-04T10:01:22.5012388Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.5012868Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5013203Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5013794Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5015002Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5016217Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5017458Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5018716Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5019921Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5021165Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5022407Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5023608Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5024817Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5026059Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5026325Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:22.5026458Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5026534Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5026600Z unimplemented [] 2025-12-04T10:01:22.5026710Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5026907Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5028265Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5028341Z graph_break [] 2025-12-04T10:01:22.5028470Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5028540Z Autotune Choices Stats: 2025-12-04T10:01:22.5030049Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.5030331Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5030560Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5030891Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5032072Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5033236Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5034451Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5035630Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5036829Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5037995Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5038250Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:22.5038325Z Autotune Choices Stats: 2025-12-04T10:01:22.5039840Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.5040322Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5040657Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5041240Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5042456Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5043701Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5044963Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5046408Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5047752Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5048994Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5050198Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5051395Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5052654Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5053853Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.5054116Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:22.5054289Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5054367Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5054434Z unimplemented [] 2025-12-04T10:01:22.5054543Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5054740Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5056404Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5056488Z graph_break [] 2025-12-04T10:01:22.5056643Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5056790Z Autotune Choices Stats: 2025-12-04T10:01:22.5058256Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.5058556Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5058789Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5059121Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5060307Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5061473Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5062696Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5063907Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5065082Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5066278Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5066564Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:22.5066643Z Autotune Choices Stats: 2025-12-04T10:01:22.5068193Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.5068643Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5068977Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5069556Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5070820Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5072037Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5073280Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5074495Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5075735Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5076977Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5078188Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5079400Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5080645Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.5081871Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5082135Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:22.5082272Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5082346Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5082411Z unimplemented [] 2025-12-04T10:01:22.5082518Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5082722Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5084039Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5084160Z graph_break [] 2025-12-04T10:01:22.5084294Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5084373Z Autotune Choices Stats: 2025-12-04T10:01:22.5085839Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.5086088Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5086323Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5086650Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5087835Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5089039Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5090234Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5091406Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5092608Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5093810Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5094063Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:22.5094138Z Autotune Choices Stats: 2025-12-04T10:01:22.5095631Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.5096080Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5096414Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5097034Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5098266Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5099511Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5100713Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5101951Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5103179Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5104380Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5105588Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5106825Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5108119Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5109335Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.5109599Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:22.5109730Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5109805Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5109871Z unimplemented [] 2025-12-04T10:01:22.5109978Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5110206Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5111436Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5111537Z graph_break [] 2025-12-04T10:01:22.5111665Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5111736Z Autotune Choices Stats: 2025-12-04T10:01:22.5113197Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.5113459Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5113687Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5114047Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5115232Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5116399Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5117606Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5118775Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5119981Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5121184Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5121444Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:22.5121515Z Autotune Choices Stats: 2025-12-04T10:01:22.5123012Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.5123498Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5123838Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5124419Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5125679Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5126888Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5128120Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5129363Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5130574Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5131786Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5133021Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5134223Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5135460Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5136662Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.5136958Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:22.5137092Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5137201Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5137265Z unimplemented [] 2025-12-04T10:01:22.5137371Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5137565Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5138805Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5138874Z graph_break [] 2025-12-04T10:01:22.5139007Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5139078Z Autotune Choices Stats: 2025-12-04T10:01:22.5140536Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.5140821Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5141049Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5141377Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5142559Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5143763Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5144935Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5146125Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5147381Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5148549Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5148802Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:22.5148873Z Autotune Choices Stats: 2025-12-04T10:01:22.5150372Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.5150860Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5151206Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5151840Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5153064Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5154280Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5155869Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5157144Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5158354Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5159579Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5160829Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5162077Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5163287Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5164521Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5164819Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:22.5164952Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5165030Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5165098Z unimplemented [] 2025-12-04T10:01:22.5165207Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5165404Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5166647Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5166715Z graph_break [] 2025-12-04T10:01:22.5166845Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5166914Z Autotune Choices Stats: 2025-12-04T10:01:22.5168393Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.5168685Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5168918Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5169246Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5170461Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5171620Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5172837Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5174051Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5175223Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5176398Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5176689Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:22.5176759Z Autotune Choices Stats: 2025-12-04T10:01:22.5178262Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.5178711Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5179085Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5179671Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5180909Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5182155Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5183391Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5184604Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5185820Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5187078Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5188405Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5189618Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5190861Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.5192092Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5192357Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:22.5192540Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:22.5192623Z Traceback (most recent call last): 2025-12-04T10:01:22.5192933Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:22.5193002Z self.assertTrue( 2025-12-04T10:01:22.5193208Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:22.5193292Z raise self.failureException(msg) 2025-12-04T10:01:22.5193539Z AssertionError: False is not true : Log file /tmp/tmpnnpp2jxf/flex_attention_configs.json was not created 2025-12-04T10:01:22.5193543Z 2025-12-04T10:01:22.5193685Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:22.5193942Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:22.5193991Z 2025-12-04T10:01:22.5194164Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:22.5194298Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5194369Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5194441Z unimplemented [] 2025-12-04T10:01:22.5194548Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5195759Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:22.5195948Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5196011Z graph_break [] 2025-12-04T10:01:22.5196148Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5197207Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:22.5197300Z current_size = base.storage().size() 2025-12-04T10:01:22.5197370Z Autotune Choices Stats: 2025-12-04T10:01:22.5198836Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:22.5199124Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5199352Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5199680Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5200837Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5201972Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5203117Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5204286Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5205464Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5206603Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5206858Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:22.5206928Z Autotune Choices Stats: 2025-12-04T10:01:22.5208426Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:22.5208896Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5209238Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5209814Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5211017Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5212237Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5213412Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5214621Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5215803Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5217018Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5218240Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5219431Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5220601Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5221817Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5222072Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:22.5222211Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5222283Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5222354Z unimplemented [] 2025-12-04T10:01:22.5222538Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5222734Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5223940Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5224004Z graph_break [] 2025-12-04T10:01:22.5224136Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5224206Z Autotune Choices Stats: 2025-12-04T10:01:22.5225668Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.5225954Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5226177Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5226504Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5227718Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5228861Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5230043Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5231219Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5232357Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5233501Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5233787Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:22.5233890Z Autotune Choices Stats: 2025-12-04T10:01:22.5235358Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.5235801Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5236136Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5236704Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5237898Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5239116Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5240320Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5241492Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5242705Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5243916Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5245095Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5246271Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5247479Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5248654Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5248939Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:22.5249077Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5249148Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5249219Z unimplemented [] 2025-12-04T10:01:22.5249335Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5249526Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5250727Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5250789Z graph_break [] 2025-12-04T10:01:22.5250975Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5251076Z Autotune Choices Stats: 2025-12-04T10:01:22.5252491Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.5252746Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5252967Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5253292Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5254443Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5258261Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5259962Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5261858Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5263698Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5265698Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5266211Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:22.5266327Z Autotune Choices Stats: 2025-12-04T10:01:22.5268759Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.5269432Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5269980Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5270754Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5272081Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5273307Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5274500Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5275708Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5276896Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5278121Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5279297Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5280479Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5281694Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.5283341Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5285018Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:22.5285505Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5285795Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5286048Z unimplemented [] 2025-12-04T10:01:22.5286346Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5286819Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5288714Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5290245Z graph_break [] 2025-12-04T10:01:22.5290489Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5290773Z Autotune Choices Stats: 2025-12-04T10:01:22.5292337Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.5294079Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5294630Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5295273Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5296830Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5299374Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5302036Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5304404Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5306810Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5309335Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5310957Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:22.5311414Z Autotune Choices Stats: 2025-12-04T10:01:22.5313073Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.5315060Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5315973Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5316959Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5318800Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5321272Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5323707Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5326178Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5328628Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5331045Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5333460Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5335913Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5338363Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5340783Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.5342276Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:22.5342759Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5343053Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5343249Z unimplemented [] 2025-12-04T10:01:22.5343498Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5343873Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5345476Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5347037Z graph_break [] 2025-12-04T10:01:22.5347346Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5347634Z Autotune Choices Stats: 2025-12-04T10:01:22.5349173Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.5350908Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5351455Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5352129Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5353672Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5356245Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5358604Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5361009Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5363351Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5365748Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5367221Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:22.5367619Z Autotune Choices Stats: 2025-12-04T10:01:22.5369192Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.5371239Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5372088Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5373065Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5374978Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5377699Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5380161Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5382620Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5385097Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5387924Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5390394Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5392811Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5395342Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.5397996Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5399523Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:22.5400021Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5400312Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5400511Z unimplemented [] 2025-12-04T10:01:22.5400718Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5401080Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5402543Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5403862Z graph_break [] 2025-12-04T10:01:22.5404091Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5404366Z Autotune Choices Stats: 2025-12-04T10:01:22.5405895Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.5407659Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5408206Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5408822Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5410369Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5412766Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5415113Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5417499Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5419857Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5422194Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5423650Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:22.5424039Z Autotune Choices Stats: 2025-12-04T10:01:22.5425615Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.5427667Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5428516Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5429548Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5431380Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5433860Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5436628Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5439178Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5441602Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5444082Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5446816Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5449277Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5451697Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5454164Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5455880Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:22.5456345Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5456625Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5456825Z unimplemented [] 2025-12-04T10:01:22.5457035Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5457405Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5458879Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5460202Z graph_break [] 2025-12-04T10:01:22.5460433Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5460714Z Autotune Choices Stats: 2025-12-04T10:01:22.5462241Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.5464105Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5464649Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5465283Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5466899Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5469345Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5471737Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5474120Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5476485Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5478827Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5480321Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:22.5480715Z Autotune Choices Stats: 2025-12-04T10:01:22.5482285Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.5484253Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5485202Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5486373Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5488320Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5490785Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5493259Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5495690Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5498118Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5500603Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5503074Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5505496Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5508051Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.5510511Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5512014Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:22.5512485Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5512770Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5512958Z unimplemented [] 2025-12-04T10:01:22.5513170Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5513543Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5515036Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5516662Z graph_break [] 2025-12-04T10:01:22.5516934Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5517220Z Autotune Choices Stats: 2025-12-04T10:01:22.5518754Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.5520479Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5521063Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5521685Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5523232Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5525702Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5528096Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5530461Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5532815Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5535176Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5536669Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:22.5537080Z Autotune Choices Stats: 2025-12-04T10:01:22.5538689Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.5540667Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5541521Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5542493Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5544362Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5546836Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5549326Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5551761Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5554230Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5556895Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5559329Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5561801Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5564229Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5566720Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.5568223Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:22.5568683Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5568973Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5569173Z unimplemented [] 2025-12-04T10:01:22.5569376Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5569752Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5571223Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5572611Z graph_break [] 2025-12-04T10:01:22.5572844Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5573139Z Autotune Choices Stats: 2025-12-04T10:01:22.5574741Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:22.5576477Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5577025Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5577642Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5579195Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5581595Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5583970Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5586334Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5588760Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5591151Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5592615Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.5593013Z Autotune Choices Stats: 2025-12-04T10:01:22.5594624Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.5596609Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5597458Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5598474Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5600346Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5602780Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5605204Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5607672Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5610096Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5612559Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5614986Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5617751Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5620213Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5622643Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.5624147Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:22.5624629Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5624957Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5625156Z unimplemented [] 2025-12-04T10:01:22.5625363Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5625738Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5627258Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5628592Z graph_break [] 2025-12-04T10:01:22.5628824Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5629115Z Autotune Choices Stats: 2025-12-04T10:01:22.5630702Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.5632440Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5632986Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5633609Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5635206Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5637586Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5639950Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5642306Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5644707Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5647096Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5648562Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:22.5648958Z Autotune Choices Stats: 2025-12-04T10:01:22.5650529Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.5652496Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5653394Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5654575Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5656600Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5659038Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5661605Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5664095Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5666569Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5669082Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5671570Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5674035Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5676457Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5678893Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.5680428Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:22.5680894Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5681187Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5681378Z unimplemented [] 2025-12-04T10:01:22.5681589Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5681962Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5683470Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5684803Z graph_break [] 2025-12-04T10:01:22.5685031Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5685320Z Autotune Choices Stats: 2025-12-04T10:01:22.5686855Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:22.5688593Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5689185Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5689839Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5691395Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5693755Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5696120Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5698520Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5700875Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5703289Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5704754Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:22.5705148Z Autotune Choices Stats: 2025-12-04T10:01:22.5706780Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.5708877Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5709729Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5710699Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5712533Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5714966Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5717441Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5719902Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5722328Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5724796Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5727267Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5729691Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5732109Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.5734574Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5736081Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:22.5736547Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5736845Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5737048Z unimplemented [] 2025-12-04T10:01:22.5737251Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5737625Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5739134Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5740465Z graph_break [] 2025-12-04T10:01:22.5740697Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5740977Z Autotune Choices Stats: 2025-12-04T10:01:22.5742557Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.5744386Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5744939Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5745564Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5747115Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5749514Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5751870Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5754268Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5757214Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5759582Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5761047Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.5761436Z Autotune Choices Stats: 2025-12-04T10:01:22.5768650Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.5770722Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5771585Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5772566Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5774416Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5776921Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5779346Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5781806Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5784231Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5786711Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5789256Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5791685Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5794103Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.5796555Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5798060Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:22.5798542Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5798881Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5799089Z unimplemented [] 2025-12-04T10:01:22.5799304Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5799675Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5801147Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5802467Z graph_break [] 2025-12-04T10:01:22.5802698Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5802984Z Autotune Choices Stats: 2025-12-04T10:01:22.5804562Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.5806329Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5806881Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5807506Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5809061Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5811407Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5813795Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5816174Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5818523Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5820916Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5822427Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:22.5822821Z Autotune Choices Stats: 2025-12-04T10:01:22.5824390Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.5826359Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5827268Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5828251Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5830097Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5832576Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5835045Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5837471Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5839926Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5842372Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5844791Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5847210Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5849666Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5852142Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5853661Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:22.5854142Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5854440Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5854635Z unimplemented [] 2025-12-04T10:01:22.5854851Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5855422Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5856971Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5858339Z graph_break [] 2025-12-04T10:01:22.5858563Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5858845Z Autotune Choices Stats: 2025-12-04T10:01:22.5860375Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.5862101Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5862648Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5863265Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5864810Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5867276Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5869618Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5872018Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5874370Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5876748Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5878245Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:22.5878640Z Autotune Choices Stats: 2025-12-04T10:01:22.5880206Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.5882169Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5883017Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5884042Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5885896Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5888369Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5890809Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5893263Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5895736Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5898157Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5900578Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5903029Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5905465Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5908022Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.5909528Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:22.5909988Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5910281Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5910493Z unimplemented [] 2025-12-04T10:01:22.5910699Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5911075Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5912580Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5913932Z graph_break [] 2025-12-04T10:01:22.5914166Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5914453Z Autotune Choices Stats: 2025-12-04T10:01:22.5915998Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.5917722Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5918272Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5918889Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5920492Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5922847Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5925253Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5927619Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5930002Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5932391Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5933845Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:22.5934233Z Autotune Choices Stats: 2025-12-04T10:01:22.5935801Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.5937760Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5938652Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5939628Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5941511Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5943948Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5946402Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5948899Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5951366Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5953787Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5956377Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.5958878Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.5961355Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.5963792Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.5965283Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:22.5965749Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.5966080Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.5966325Z unimplemented [] 2025-12-04T10:01:22.5966535Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.5966909Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.5968372Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.5969701Z graph_break [] 2025-12-04T10:01:22.5969938Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.5970224Z Autotune Choices Stats: 2025-12-04T10:01:22.5971757Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.5973488Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5974084Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5974711Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5976261Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5978652Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5980992Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5983377Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.5985785Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.5988245Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5989712Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:22.5990104Z Autotune Choices Stats: 2025-12-04T10:01:22.5991668Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.5993673Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.5994523Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.5995498Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.5997372Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.5999818Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6002278Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6004735Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6007144Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6009565Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6012021Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6014471Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6016885Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6019330Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6020849Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:22.6021309Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6021608Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6021800Z unimplemented [] 2025-12-04T10:01:22.6022007Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6022373Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6023838Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6025163Z graph_break [] 2025-12-04T10:01:22.6025384Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6025664Z Autotune Choices Stats: 2025-12-04T10:01:22.6027256Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6029030Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6029577Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6030188Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6031776Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6034116Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6036546Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6038891Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6041287Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6043628Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6045099Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:22.6045528Z Autotune Choices Stats: 2025-12-04T10:01:22.6047093Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6049057Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6049957Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6050945Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6052782Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6055472Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6057969Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6060383Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6062806Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6065271Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6067790Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6068965Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6070140Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6071349Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6071636Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:22.6071775Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6071859Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6071923Z unimplemented [] 2025-12-04T10:01:22.6072030Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6072225Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6073422Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6073489Z graph_break [] 2025-12-04T10:01:22.6073623Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6073732Z Autotune Choices Stats: 2025-12-04T10:01:22.6075199Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6075500Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6075766Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6076153Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6077514Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6078649Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6079819Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6080995Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6082131Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6083263Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6083548Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:22.6083622Z Autotune Choices Stats: 2025-12-04T10:01:22.6085119Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6085698Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6086096Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6086771Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6087991Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6089216Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6090403Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6091587Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6092798Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6093964Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6095193Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6096364Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6097578Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6098781Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6099040Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:22.6099171Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6099248Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6099311Z unimplemented [] 2025-12-04T10:01:22.6099417Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6099610Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6100806Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6100907Z graph_break [] 2025-12-04T10:01:22.6101039Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6101112Z Autotune Choices Stats: 2025-12-04T10:01:22.6102533Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6102818Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6103047Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6103368Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6104515Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6105681Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6106895Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6108093Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6109233Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6110415Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6110670Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:22.6110744Z Autotune Choices Stats: 2025-12-04T10:01:22.6112239Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.6112692Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6113028Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6113604Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6114826Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6116044Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6117221Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6118406Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6119604Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6120811Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6121994Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6123196Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6124398Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6125566Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6125822Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:22.6125954Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6126042Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6126109Z unimplemented [] 2025-12-04T10:01:22.6126215Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6126466Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6127667Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6127737Z graph_break [] 2025-12-04T10:01:22.6127869Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6127939Z Autotune Choices Stats: 2025-12-04T10:01:22.6129395Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6129646Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6129871Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6130188Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6131368Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6132529Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6133665Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6134799Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6135960Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6137101Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6137356Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:22.6137428Z Autotune Choices Stats: 2025-12-04T10:01:22.6138919Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6139364Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6139729Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6140334Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6141516Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6142694Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6143872Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6145094Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6146297Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6147523Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6148730Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6149927Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6151113Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6152282Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6152540Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:22.6152751Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:22.6152839Z Traceback (most recent call last): 2025-12-04T10:01:22.6153147Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:22.6153216Z self.assertTrue( 2025-12-04T10:01:22.6153426Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:22.6153512Z raise self.failureException(msg) 2025-12-04T10:01:22.6153765Z AssertionError: False is not true : Log file /tmp/tmp9e0yqvfi/flex_attention_configs.json was not created 2025-12-04T10:01:22.6153771Z 2025-12-04T10:01:22.6153908Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:22.6154170Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:22.6154176Z 2025-12-04T10:01:22.6154345Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:22.6154516Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6154596Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6154662Z unimplemented [] 2025-12-04T10:01:22.6154770Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6156126Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:22.6156317Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6156391Z graph_break [] 2025-12-04T10:01:22.6156524Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6157593Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:22.6157759Z current_size = base.storage().size() 2025-12-04T10:01:22.6157830Z Autotune Choices Stats: 2025-12-04T10:01:22.6159265Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:22.6159519Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6159751Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6160069Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6161230Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6162410Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6163590Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6164720Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6165897Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6167057Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6167312Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:22.6167382Z Autotune Choices Stats: 2025-12-04T10:01:22.6168842Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:22.6169284Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6169627Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6170242Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6171427Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6172648Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6173823Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6175031Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6176250Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6177429Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6178596Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6179801Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6181010Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6182179Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6182434Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:22.6182568Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6182642Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6182715Z unimplemented [] 2025-12-04T10:01:22.6182820Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6183045Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6184301Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6184363Z graph_break [] 2025-12-04T10:01:22.6184502Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6184571Z Autotune Choices Stats: 2025-12-04T10:01:22.6185997Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6186244Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6186471Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6186826Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6188012Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6189234Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6190378Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6191512Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6192693Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6193860Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6194117Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:22.6194185Z Autotune Choices Stats: 2025-12-04T10:01:22.6195646Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6196122Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6196462Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6197037Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6198254Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6199428Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6200647Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6201856Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6203030Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6204207Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6205411Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6206595Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6207795Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6208969Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6209256Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:22.6209391Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6209501Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6209570Z unimplemented [] 2025-12-04T10:01:22.6209674Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6209860Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6211056Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6211120Z graph_break [] 2025-12-04T10:01:22.6211256Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6211326Z Autotune Choices Stats: 2025-12-04T10:01:22.6212739Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6213024Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6213252Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6213572Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6214720Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6215906Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6217047Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6218209Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6219387Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6220524Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6220780Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:22.6220848Z Autotune Choices Stats: 2025-12-04T10:01:22.6222308Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6222781Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6223125Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6223744Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6224932Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6226112Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6227368Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6228588Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6229771Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6230942Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6232136Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6233347Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6234516Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6235731Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6236015Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:22.6236147Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6236217Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6236288Z unimplemented [] 2025-12-04T10:01:22.6236389Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6236581Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6237793Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6237857Z graph_break [] 2025-12-04T10:01:22.6237990Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6238061Z Autotune Choices Stats: 2025-12-04T10:01:22.6239480Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6239767Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6239999Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6240323Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6241501Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6242640Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6243820Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6244987Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6246127Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6247252Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6247540Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:22.6247607Z Autotune Choices Stats: 2025-12-04T10:01:22.6249063Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6249499Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6249872Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6250444Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6251628Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6252840Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6254050Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6255383Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6256573Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6257831Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6259047Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6260222Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6261431Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6262667Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6262922Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:22.6263056Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6263129Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6263198Z unimplemented [] 2025-12-04T10:01:22.6263301Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6263490Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6264686Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6264785Z graph_break [] 2025-12-04T10:01:22.6264929Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6265008Z Autotune Choices Stats: 2025-12-04T10:01:22.6266698Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6266991Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6267306Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6267684Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6268829Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6270002Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6271145Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6272307Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6273441Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6274573Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6274863Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:22.6274932Z Autotune Choices Stats: 2025-12-04T10:01:22.6276399Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6276874Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6277213Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6277783Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6279008Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6280219Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6281405Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6282576Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6283787Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6284964Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6286172Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6287348Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6288548Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6289773Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6290023Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:22.6290160Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6290229Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6290298Z unimplemented [] 2025-12-04T10:01:22.6290400Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6290583Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6291785Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6291883Z graph_break [] 2025-12-04T10:01:22.6292023Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6292090Z Autotune Choices Stats: 2025-12-04T10:01:22.6293507Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6293795Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6294019Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6294347Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6295490Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6296659Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6297827Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6298973Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6300108Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6301284Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6301537Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:22.6301604Z Autotune Choices Stats: 2025-12-04T10:01:22.6303105Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6303547Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6303887Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6304490Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6305680Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6306891Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6308107Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6309276Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6310487Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6311700Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6312871Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6314079Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6315293Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6316467Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6316712Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:22.6316857Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6316930Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6316999Z unimplemented [] 2025-12-04T10:01:22.6317139Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6317323Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6318526Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6318590Z graph_break [] 2025-12-04T10:01:22.6318722Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6318788Z Autotune Choices Stats: 2025-12-04T10:01:22.6320239Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6320492Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6320712Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6321034Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6322232Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6323597Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6324732Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6325880Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6327097Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6328239Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6328525Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:22.6328592Z Autotune Choices Stats: 2025-12-04T10:01:22.6330059Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.6330509Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6330887Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6331488Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6332684Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6333864Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6335049Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6336392Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6337603Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6338784Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6339984Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6341194Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6342372Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6343552Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6343834Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:22.6343973Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6344045Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6344118Z unimplemented [] 2025-12-04T10:01:22.6344230Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6344418Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6345755Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6345832Z graph_break [] 2025-12-04T10:01:22.6346035Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6346120Z Autotune Choices Stats: 2025-12-04T10:01:22.6347802Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.6348052Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6348279Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6348648Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6349831Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6350974Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6352119Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6353260Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6354435Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6355992Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6356303Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:22.6356385Z Autotune Choices Stats: 2025-12-04T10:01:22.6357981Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6358425Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6358809Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6359387Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6360593Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6361770Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6363001Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6364210Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6365387Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6366572Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6367779Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6368994Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6370165Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6371353Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6371640Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:22.6371774Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6371845Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6371907Z unimplemented [] 2025-12-04T10:01:22.6372016Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6372204Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6373448Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6373514Z graph_break [] 2025-12-04T10:01:22.6373651Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6373719Z Autotune Choices Stats: 2025-12-04T10:01:22.6375172Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:22.6375441Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6375699Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6376024Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6377176Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6378320Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6379455Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6380631Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6381808Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6382942Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6383192Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.6383261Z Autotune Choices Stats: 2025-12-04T10:01:22.6384756Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6385300Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6385700Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6386391Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6387791Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6389032Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6390209Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6391421Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6392601Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6393818Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6395068Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6396469Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6397707Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6398916Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6399169Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:22.6399315Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6399388Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6399451Z unimplemented [] 2025-12-04T10:01:22.6399594Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6399784Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6400986Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6401052Z graph_break [] 2025-12-04T10:01:22.6401188Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6401257Z Autotune Choices Stats: 2025-12-04T10:01:22.6402723Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6403010Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6403234Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6403558Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6404707Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6405840Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6407009Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6408181Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6409327Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6410463Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6410746Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:22.6410849Z Autotune Choices Stats: 2025-12-04T10:01:22.6412309Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6412762Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6413099Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6413676Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6414863Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6416084Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6417298Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6418470Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6419679Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6420898Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6422083Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6423260Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6424467Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6425650Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6425930Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:22.6426070Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6426140Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6426204Z unimplemented [] 2025-12-04T10:01:22.6426319Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6426516Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6427760Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6427824Z graph_break [] 2025-12-04T10:01:22.6427990Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6428097Z Autotune Choices Stats: 2025-12-04T10:01:22.6429532Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:22.6429786Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6430011Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6430335Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6431486Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6432670Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6433804Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6434982Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6436126Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6437301Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6437591Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:22.6437657Z Autotune Choices Stats: 2025-12-04T10:01:22.6439113Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6439558Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6439890Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6440464Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6441683Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6442906Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6444095Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6445313Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6446490Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6447698Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6448880Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6450054Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6451256Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6452475Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6452724Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:22.6452862Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6452932Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6452995Z unimplemented [] 2025-12-04T10:01:22.6453101Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6453289Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6454557Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6454654Z graph_break [] 2025-12-04T10:01:22.6454788Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6454862Z Autotune Choices Stats: 2025-12-04T10:01:22.6456453Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6456710Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6456931Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6457256Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6458407Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6459616Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6460813Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6461959Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6463141Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6464386Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6464637Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.6464704Z Autotune Choices Stats: 2025-12-04T10:01:22.6466171Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6466615Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6466987Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6467617Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6468807Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6470028Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6471216Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6472424Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6473636Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6474810Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6475994Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6477206Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6478414Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6479599Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6479846Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:22.6479984Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6480058Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6480123Z unimplemented [] 2025-12-04T10:01:22.6480265Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6480485Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6481689Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6481752Z graph_break [] 2025-12-04T10:01:22.6481880Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6481953Z Autotune Choices Stats: 2025-12-04T10:01:22.6483375Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.6483629Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6483851Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6484217Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6485372Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6486573Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6487711Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6488891Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6490061Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6491203Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6491456Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:22.6491525Z Autotune Choices Stats: 2025-12-04T10:01:22.6492980Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.6493459Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6493809Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6494387Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6495614Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6496799Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6498013Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6499213Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6500397Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6501570Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6502783Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6503987Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6505160Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6506376Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6506627Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:22.6506795Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6506866Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6506929Z unimplemented [] 2025-12-04T10:01:22.6507036Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6507294Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6508513Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6508580Z graph_break [] 2025-12-04T10:01:22.6508712Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6508784Z Autotune Choices Stats: 2025-12-04T10:01:22.6510199Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6510490Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6510710Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6511035Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6512179Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6513359Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6514493Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6515666Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6516834Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6517977Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6518222Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:22.6518295Z Autotune Choices Stats: 2025-12-04T10:01:22.6519755Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6520252Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6520582Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6521188Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6522376Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6523598Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6524814Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6526001Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6527195Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6528412Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6529586Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6530790Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6531962Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6533169Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6533453Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:22.6533591Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6533661Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6533726Z unimplemented [] 2025-12-04T10:01:22.6533833Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6534023Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6535221Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6535285Z graph_break [] 2025-12-04T10:01:22.6535411Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6535487Z Autotune Choices Stats: 2025-12-04T10:01:22.6536907Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6537197Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6537418Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6537740Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6538925Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6540065Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6541249Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6542432Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6543575Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6544713Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6544996Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:22.6545073Z Autotune Choices Stats: 2025-12-04T10:01:22.6546538Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.6546983Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6547392Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6547974Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6549162Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6550386Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6551621Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6552797Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6553979Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6555347Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6556600Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6557774Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6559008Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6560234Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6560483Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:22.6560624Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6560700Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6560766Z unimplemented [] 2025-12-04T10:01:22.6560879Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6561067Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6562278Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6562393Z graph_break [] 2025-12-04T10:01:22.6562527Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6562604Z Autotune Choices Stats: 2025-12-04T10:01:22.6564025Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6564280Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6564537Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6564863Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6566023Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6567197Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6568362Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6569511Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6570648Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6571823Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6572072Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:22.6572147Z Autotune Choices Stats: 2025-12-04T10:01:22.6573650Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6574103Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6574436Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6575010Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6576246Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6577459Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6578635Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6579816Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6581022Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6582233Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6583413Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6584641Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6585848Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6587027Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6587315Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:22.6587451Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6587523Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6587587Z unimplemented [] 2025-12-04T10:01:22.6587694Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6587881Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6589122Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6589202Z graph_break [] 2025-12-04T10:01:22.6589334Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6589407Z Autotune Choices Stats: 2025-12-04T10:01:22.6590862Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6591115Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6591338Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6591659Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6592836Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6593981Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6595151Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6596308Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6597441Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6598617Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6598861Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:22.6598937Z Autotune Choices Stats: 2025-12-04T10:01:22.6600429Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6600876Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6601213Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6601861Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6603080Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6604265Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6605449Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6606669Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6607849Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6609056Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6610237Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6611449Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6612660Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6613836Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6614083Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:22.6614218Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6614332Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6614395Z unimplemented [] 2025-12-04T10:01:22.6614508Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6614699Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6615893Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6615964Z graph_break [] 2025-12-04T10:01:22.6616092Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6616173Z Autotune Choices Stats: 2025-12-04T10:01:22.6617640Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6617897Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6618121Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6618439Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6619632Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6620808Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6621955Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6623102Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6624268Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6625540Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6625847Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:22.6625939Z Autotune Choices Stats: 2025-12-04T10:01:22.6627706Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6628194Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6628529Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6629140Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6630327Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6631509Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6632687Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6633908Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6635168Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6636568Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6637819Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6639035Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6640219Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6641398Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6641681Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:22.6641809Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6641884Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6641947Z unimplemented [] 2025-12-04T10:01:22.6642053Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6642241Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6643469Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6643541Z graph_break [] 2025-12-04T10:01:22.6643668Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6643741Z Autotune Choices Stats: 2025-12-04T10:01:22.6645152Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6645415Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6645671Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6646022Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6647171Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6648310Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6649454Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6650648Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6651788Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6652959Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6653208Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:22.6653280Z Autotune Choices Stats: 2025-12-04T10:01:22.6654759Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.6655379Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6655716Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6656292Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6657481Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6658670Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6659909Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6661139Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6662314Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6663541Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6664773Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6665953Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6667130Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6668385Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6668639Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:22.6668771Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6668853Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6668917Z unimplemented [] 2025-12-04T10:01:22.6669025Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6669216Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6670452Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6670523Z graph_break [] 2025-12-04T10:01:22.6670655Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6670727Z Autotune Choices Stats: 2025-12-04T10:01:22.6672179Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6672466Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6672688Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6673010Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6674165Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6675317Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6676492Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6677633Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6678803Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6679947Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6680198Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:22.6680274Z Autotune Choices Stats: 2025-12-04T10:01:22.6681785Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6682296Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6682637Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6683222Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6684413Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6685637Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6686849Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6688029Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6689251Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6690424Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6691637Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6692812Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6693999Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6695209Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6695465Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:22.6695593Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6695700Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6695766Z unimplemented [] 2025-12-04T10:01:22.6695870Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6696071Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6697261Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6697330Z graph_break [] 2025-12-04T10:01:22.6697459Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6697534Z Autotune Choices Stats: 2025-12-04T10:01:22.6698984Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6699276Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6699509Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6699828Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6700976Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6702111Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6703277Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6704457Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6705615Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6706782Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6707062Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:22.6707135Z Autotune Choices Stats: 2025-12-04T10:01:22.6708627Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.6709080Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6709416Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6709997Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6711227Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6712410Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6713625Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6714802Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6716031Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6717237Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6718420Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6719596Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6720811Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6722019Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6722274Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:22.6722450Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:22.6722540Z Traceback (most recent call last): 2025-12-04T10:01:22.6722838Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:22.6722912Z self.assertTrue( 2025-12-04T10:01:22.6723108Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:22.6723191Z raise self.failureException(msg) 2025-12-04T10:01:22.6723447Z AssertionError: False is not true : Log file /tmp/tmpfs0cn7zn/flex_attention_configs.json was not created 2025-12-04T10:01:22.6723452Z 2025-12-04T10:01:22.6723590Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:22.6723884Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:22.6723921Z 2025-12-04T10:01:22.6724092Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:22.6724225Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6724303Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6724370Z unimplemented [] 2025-12-04T10:01:22.6724476Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6725706Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:22.6725898Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6725967Z graph_break [] 2025-12-04T10:01:22.6726099Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6727097Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:22.6727227Z current_size = base.storage().size() 2025-12-04T10:01:22.6727298Z Autotune Choices Stats: 2025-12-04T10:01:22.6728732Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:22.6728984Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6729216Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6729576Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6730735Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6731870Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6733045Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6734208Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6735357Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6736505Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6736788Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:22.6736857Z Autotune Choices Stats: 2025-12-04T10:01:22.6738324Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:22.6738846Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6739199Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6739769Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6741000Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6742216Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6743389Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6744567Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6745772Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6746946Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6748212Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6749387Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6750623Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6751838Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6752099Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:22.6752232Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6752307Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6752379Z unimplemented [] 2025-12-04T10:01:22.6752482Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6752673Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6753873Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6753975Z graph_break [] 2025-12-04T10:01:22.6754113Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6754188Z Autotune Choices Stats: 2025-12-04T10:01:22.6755780Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6756095Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6756325Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6756643Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6757793Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6758972Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6760152Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6761289Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6762427Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6763607Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6763862Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:22.6763929Z Autotune Choices Stats: 2025-12-04T10:01:22.6765422Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6765873Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6766210Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6766778Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6767999Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6769207Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6770381Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6771552Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6772758Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6773964Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6775135Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6776347Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6777798Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6778977Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6779229Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:22.6779371Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6779442Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6779513Z unimplemented [] 2025-12-04T10:01:22.6779617Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6779865Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6781058Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6781122Z graph_break [] 2025-12-04T10:01:22.6781259Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6781329Z Autotune Choices Stats: 2025-12-04T10:01:22.6782786Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6783036Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6783262Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6783582Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6784779Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6785940Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6787081Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6788256Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6789433Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6790569Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6790824Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:22.6790892Z Autotune Choices Stats: 2025-12-04T10:01:22.6792387Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6792834Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6793205Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6793806Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6794995Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6796180Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6797359Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6798567Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6803901Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6805105Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6806323Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6807495Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6808695Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6809865Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6810131Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:22.6810312Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6810388Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6810463Z unimplemented [] 2025-12-04T10:01:22.6810577Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6810774Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6811983Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6812051Z graph_break [] 2025-12-04T10:01:22.6812197Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6812280Z Autotune Choices Stats: 2025-12-04T10:01:22.6813758Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6814014Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6814248Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6814603Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6815786Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6816984Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6818124Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6819252Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6820412Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6821578Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6821842Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:22.6821910Z Autotune Choices Stats: 2025-12-04T10:01:22.6823377Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6823853Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6824231Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6824805Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6826014Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6827181Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6828476Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6829641Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6830851Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6832024Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6833226Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6834433Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6835617Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6836791Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6837079Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:22.6837217Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6837289Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6837357Z unimplemented [] 2025-12-04T10:01:22.6837460Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6837652Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6838882Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6838950Z graph_break [] 2025-12-04T10:01:22.6839089Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6839157Z Autotune Choices Stats: 2025-12-04T10:01:22.6840571Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6840857Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6841081Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6841442Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6842593Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6843735Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6844883Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6846059Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6847236Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6848367Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6848620Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:22.6848690Z Autotune Choices Stats: 2025-12-04T10:01:22.6850201Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6850674Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6851011Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6851583Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6852770Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6853945Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6855155Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6856810Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6858022Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6859266Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6860484Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6861658Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6862835Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6864050Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6864306Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:22.6864466Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6864541Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6864615Z unimplemented [] 2025-12-04T10:01:22.6864723Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6864950Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6866168Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6866234Z graph_break [] 2025-12-04T10:01:22.6866373Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6866442Z Autotune Choices Stats: 2025-12-04T10:01:22.6867991Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6868278Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6868502Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6868830Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6869981Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6871123Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6872305Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6873438Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6874607Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6875748Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6876010Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:22.6876078Z Autotune Choices Stats: 2025-12-04T10:01:22.6877575Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6878051Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6878396Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6878967Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6880168Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6881431Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6882660Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6883832Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6885048Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6886286Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6887455Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6888632Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6889843Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6891011Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6891259Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:22.6891429Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6891502Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6891570Z unimplemented [] 2025-12-04T10:01:22.6891677Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6891866Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6893069Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6893132Z graph_break [] 2025-12-04T10:01:22.6893267Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6893367Z Autotune Choices Stats: 2025-12-04T10:01:22.6894794Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6895081Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6895314Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6895640Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6896783Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6897925Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6899085Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6900253Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6901391Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6902561Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6902843Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:22.6902909Z Autotune Choices Stats: 2025-12-04T10:01:22.6904373Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.6904813Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6905155Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6905724Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6906963Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6908174Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6909395Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6910571Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6911786Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6912992Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6914161Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6915339Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6916576Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6917786Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6918037Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:22.6918173Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6918242Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6918313Z unimplemented [] 2025-12-04T10:01:22.6918416Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6918602Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6921767Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6921873Z graph_break [] 2025-12-04T10:01:22.6922020Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6922098Z Autotune Choices Stats: 2025-12-04T10:01:22.6923528Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.6923808Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6924042Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6924357Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6925518Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6926649Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6927824Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6928966Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6930141Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6931370Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6931624Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:22.6931695Z Autotune Choices Stats: 2025-12-04T10:01:22.6933151Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6933595Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6933942Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6934507Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6935698Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6936904Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6938075Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6939322Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6940519Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6941688Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6942857Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6944024Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6945227Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6946396Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6946653Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:22.6946787Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6946897Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6946969Z unimplemented [] 2025-12-04T10:01:22.6947077Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6947381Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6948625Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6948688Z graph_break [] 2025-12-04T10:01:22.6948828Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6948898Z Autotune Choices Stats: 2025-12-04T10:01:22.6950325Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:22.6950577Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6950804Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6951119Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6952267Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6953390Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6954559Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6955886Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6957160Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6958342Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6958600Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.6958670Z Autotune Choices Stats: 2025-12-04T10:01:22.6960121Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6960571Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6960909Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6961471Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6962702Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6963868Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6965066Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6966315Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6967480Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6968649Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6969810Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6970976Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6972179Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6973338Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.6973664Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:22.6973807Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.6973912Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.6973982Z unimplemented [] 2025-12-04T10:01:22.6974086Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.6974273Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.6975568Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.6975641Z graph_break [] 2025-12-04T10:01:22.6975808Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.6975889Z Autotune Choices Stats: 2025-12-04T10:01:22.6977550Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.6977799Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6978024Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6978340Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6979487Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6980667Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6981802Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.6983018Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.6984179Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6985402Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6985704Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:22.6985782Z Autotune Choices Stats: 2025-12-04T10:01:22.6987445Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.6987894Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.6988234Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.6988833Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.6990026Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6991232Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6992439Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6993645Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6994824Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.6996009Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.6997174Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.6998380Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.6999547Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7000792Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7001079Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:22.7001213Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7001283Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7001354Z unimplemented [] 2025-12-04T10:01:22.7001457Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7001651Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7002854Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7002916Z graph_break [] 2025-12-04T10:01:22.7003056Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7003125Z Autotune Choices Stats: 2025-12-04T10:01:22.7004551Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:22.7004806Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7005035Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7005361Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7006540Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7007669Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7008835Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7010028Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7011160Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7012299Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7012565Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:22.7012635Z Autotune Choices Stats: 2025-12-04T10:01:22.7014095Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7014540Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7014944Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7015508Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7016696Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7017984Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7019203Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7020376Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7021553Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7022726Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7023923Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7025106Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7026316Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7027586Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7027844Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:22.7027982Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7028056Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7028127Z unimplemented [] 2025-12-04T10:01:22.7028233Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7028423Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7029629Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7029691Z graph_break [] 2025-12-04T10:01:22.7029829Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7029897Z Autotune Choices Stats: 2025-12-04T10:01:22.7031321Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7031571Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7031805Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7032161Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7033307Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7034469Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7035640Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7036808Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7037938Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7039077Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7039334Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.7039402Z Autotune Choices Stats: 2025-12-04T10:01:22.7040893Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7041337Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7041674Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7042228Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7043487Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7044715Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7045898Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7047061Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7048235Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7049399Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7050594Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7051770Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7053000Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7054206Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7054459Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:22.7054590Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7054662Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7054729Z unimplemented [] 2025-12-04T10:01:22.7054831Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7055017Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7056403Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7056473Z graph_break [] 2025-12-04T10:01:22.7056612Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7056680Z Autotune Choices Stats: 2025-12-04T10:01:22.7058161Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.7058413Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7058641Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7058958Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7060112Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7061337Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7062507Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7063644Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7064776Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7065916Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7066162Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:22.7066236Z Autotune Choices Stats: 2025-12-04T10:01:22.7067761Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.7068209Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7068539Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7069182Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7070426Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7071604Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7072778Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7073955Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7075123Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7076327Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7077497Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7078738Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7079940Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7081108Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7081362Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:22.7081491Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7081565Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7081629Z unimplemented [] 2025-12-04T10:01:22.7081731Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7081921Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7083121Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7083187Z graph_break [] 2025-12-04T10:01:22.7083314Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7083385Z Autotune Choices Stats: 2025-12-04T10:01:22.7084841Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7085095Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7085329Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7085642Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7086857Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7088016Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7089157Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7090293Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7091417Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7092549Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7092828Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:22.7092902Z Autotune Choices Stats: 2025-12-04T10:01:22.7094346Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7094823Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7095187Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7095789Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7096982Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7098161Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7099324Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7100498Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7101703Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7102876Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7104082Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7105328Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7106502Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7107713Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7107965Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:22.7108101Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7108176Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7108244Z unimplemented [] 2025-12-04T10:01:22.7108344Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7108540Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7109727Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7109799Z graph_break [] 2025-12-04T10:01:22.7109965Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7110036Z Autotune Choices Stats: 2025-12-04T10:01:22.7111446Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7111744Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7111974Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7112321Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7113496Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7114624Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7115757Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7116896Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7118019Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7119194Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7119440Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:22.7119512Z Autotune Choices Stats: 2025-12-04T10:01:22.7120999Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.7121516Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7121846Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7122405Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7123586Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7124759Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7125932Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7127141Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7128317Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7129522Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7130716Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7131914Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7133084Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7134245Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7134497Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:22.7134627Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7134701Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7134776Z unimplemented [] 2025-12-04T10:01:22.7134878Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7135090Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7136574Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7136654Z graph_break [] 2025-12-04T10:01:22.7136804Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7136880Z Autotune Choices Stats: 2025-12-04T10:01:22.7138408Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7138714Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7138938Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7139249Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7140386Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7141508Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7142640Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7143774Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7144961Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7146102Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7146342Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:22.7146449Z Autotune Choices Stats: 2025-12-04T10:01:22.7147985Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7148459Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7148793Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7149365Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7150544Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7151718Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7152880Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7154086Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7155611Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7157000Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7158235Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7159402Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7160573Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7161733Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7161991Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:22.7162131Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7162210Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7162273Z unimplemented [] 2025-12-04T10:01:22.7162428Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7162625Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7163827Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7163895Z graph_break [] 2025-12-04T10:01:22.7164030Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7164134Z Autotune Choices Stats: 2025-12-04T10:01:22.7165600Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7165881Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7166105Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7166426Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7167575Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7168699Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7169843Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7171012Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7172142Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7173278Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7173603Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:22.7173726Z Autotune Choices Stats: 2025-12-04T10:01:22.7175194Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7175652Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7175982Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7176542Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7177733Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7178921Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7180125Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7181326Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7182567Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7183775Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7184945Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7186116Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7187337Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7188539Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7188799Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:22.7188934Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7189013Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7189076Z unimplemented [] 2025-12-04T10:01:22.7189191Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7189387Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7190584Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7190722Z graph_break [] 2025-12-04T10:01:22.7190855Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7190956Z Autotune Choices Stats: 2025-12-04T10:01:22.7192376Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7192620Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7192847Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7193163Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7194306Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7195439Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7196584Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7197752Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7198892Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7200095Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7200372Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:22.7200441Z Autotune Choices Stats: 2025-12-04T10:01:22.7201899Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7202337Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7202678Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7203247Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7204431Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7205664Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7206835Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7208049Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7209297Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7210479Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7211640Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7212815Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7213993Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7215190Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7215450Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:22.7215582Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7215653Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7215714Z unimplemented [] 2025-12-04T10:01:22.7215818Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7216044Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7217262Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7217362Z graph_break [] 2025-12-04T10:01:22.7217501Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7217569Z Autotune Choices Stats: 2025-12-04T10:01:22.7218992Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7219237Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7219459Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7219771Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7220920Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7222043Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7223227Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7224351Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7225552Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7226714Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7226962Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:22.7227030Z Autotune Choices Stats: 2025-12-04T10:01:22.7228528Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.7228967Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7229307Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7229867Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7231056Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7232273Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7233453Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7234878Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7236097Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7237270Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7238446Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7239617Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7240830Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7241996Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7242305Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:22.7242437Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7242506Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7242615Z unimplemented [] 2025-12-04T10:01:22.7242719Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7242944Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7244140Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7244207Z graph_break [] 2025-12-04T10:01:22.7244341Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7244412Z Autotune Choices Stats: 2025-12-04T10:01:22.7245848Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7246095Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7246326Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7246641Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7247784Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7248944Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7250094Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7251253Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7252452Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7253589Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7253839Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:22.7253906Z Autotune Choices Stats: 2025-12-04T10:01:22.7255543Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7255990Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7256328Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7256887Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7258141Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7259314Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7260579Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7261799Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7262975Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7264169Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7265432Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7266877Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7268187Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7269388Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7269707Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:22.7269847Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7269915Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7269989Z unimplemented [] 2025-12-04T10:01:22.7270092Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7270281Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7271469Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7271537Z graph_break [] 2025-12-04T10:01:22.7271663Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7271728Z Autotune Choices Stats: 2025-12-04T10:01:22.7273147Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7273393Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7273617Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7273929Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7275183Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7276522Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7277697Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7278886Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7280050Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7281176Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7281424Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:22.7281490Z Autotune Choices Stats: 2025-12-04T10:01:22.7282949Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.7283397Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7283736Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7284336Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7285539Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7286749Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7287987Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7289157Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7290319Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7291491Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7292654Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7293902Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7295080Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7296322Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7296603Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:22.7296732Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7296801Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7296868Z unimplemented [] 2025-12-04T10:01:22.7296969Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7297157Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7298343Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7298404Z graph_break [] 2025-12-04T10:01:22.7298535Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7298601Z Autotune Choices Stats: 2025-12-04T10:01:22.7300023Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7300267Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7300490Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7300801Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7301980Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7303110Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7304300Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7305468Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7306600Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7307773Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7308028Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:22.7308094Z Autotune Choices Stats: 2025-12-04T10:01:22.7309549Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:22.7310027Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7310364Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7310923Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7312143Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7313343Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7314546Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7315720Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7316886Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7318060Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7319262Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7320439Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7321679Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7322875Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7323125Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:22.7323304Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:22.7323384Z Traceback (most recent call last): 2025-12-04T10:01:22.7323688Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:22.7323754Z self.assertTrue( 2025-12-04T10:01:22.7323958Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:22.7324042Z raise self.failureException(msg) 2025-12-04T10:01:22.7324276Z AssertionError: False is not true : Log file /tmp/tmp_nrs4kuo/flex_attention_configs.json was not created 2025-12-04T10:01:22.7324282Z 2025-12-04T10:01:22.7324419Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:22.7324677Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:22.7324681Z 2025-12-04T10:01:22.7324852Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:22.7324989Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7325063Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7325130Z unimplemented [] 2025-12-04T10:01:22.7325240Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7326453Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:22.7326689Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7326751Z graph_break [] 2025-12-04T10:01:22.7326885Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7327886Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:22.7327970Z current_size = base.storage().size() 2025-12-04T10:01:22.7328036Z Autotune Choices Stats: 2025-12-04T10:01:22.7329483Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:22.7329800Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7330019Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7330343Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7331486Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7332618Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7333739Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7334858Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7336027Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7337170Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7337458Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:22.7337528Z Autotune Choices Stats: 2025-12-04T10:01:22.7339037Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:22.7339511Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7339853Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7340421Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7341614Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7342785Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7344004Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7345174Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7346383Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7347621Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7348823Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7349992Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7351155Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7352323Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7352571Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:22.7352743Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7352813Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7352877Z unimplemented [] 2025-12-04T10:01:22.7352984Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7353175Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7354387Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7354483Z graph_break [] 2025-12-04T10:01:22.7354619Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7354690Z Autotune Choices Stats: 2025-12-04T10:01:22.7356577Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7356950Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7357215Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7357600Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7358777Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7359913Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7361044Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7362223Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7363353Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7364513Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7364839Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:22.7364915Z Autotune Choices Stats: 2025-12-04T10:01:22.7366648Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7367123Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7367455Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7368017Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7369195Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7370374Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7371583Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7372751Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7374010Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7375205Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7376370Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7377529Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7378691Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7379898Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7380152Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:22.7380289Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7380358Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7380420Z unimplemented [] 2025-12-04T10:01:22.7380527Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7380712Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7381950Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7382084Z graph_break [] 2025-12-04T10:01:22.7382215Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7382288Z Autotune Choices Stats: 2025-12-04T10:01:22.7383695Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7383954Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7384174Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7384494Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7385633Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7386770Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7388001Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7389155Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7390291Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7391488Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7391770Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:22.7391836Z Autotune Choices Stats: 2025-12-04T10:01:22.7393295Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7393735Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7394063Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7394633Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7395813Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7397016Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7398191Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7399422Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7400621Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7401780Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7402949Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7404123Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7405295Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7406521Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7406770Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:22.7406903Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7406972Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7407067Z unimplemented [] 2025-12-04T10:01:22.7407173Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7407362Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7408587Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7408681Z graph_break [] 2025-12-04T10:01:22.7408810Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7408880Z Autotune Choices Stats: 2025-12-04T10:01:22.7410291Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7410542Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7410761Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7411077Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7412216Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7413357Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7414526Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7415672Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7416858Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7418021Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7418275Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:22.7418341Z Autotune Choices Stats: 2025-12-04T10:01:22.7419789Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7420236Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7420569Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7421136Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7422351Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7423538Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7424751Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7426023Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7427191Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7428393Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7429561Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7430730Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7431928Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7433098Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7433379Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:22.7433549Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7433619Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7433731Z unimplemented [] 2025-12-04T10:01:22.7433841Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7434028Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7435224Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7435286Z graph_break [] 2025-12-04T10:01:22.7435418Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7435491Z Autotune Choices Stats: 2025-12-04T10:01:22.7436903Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7437154Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7437376Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7437691Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7438825Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7439997Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7441131Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7442330Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7443494Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7444636Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7444888Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:22.7444954Z Autotune Choices Stats: 2025-12-04T10:01:22.7446414Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7446872Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7447205Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7447772Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7448988Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7450168Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7451405Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7452604Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7453772Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7454937Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7456268Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7457506Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7458672Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7459893Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7460224Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:22.7460359Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7460426Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7460487Z unimplemented [] 2025-12-04T10:01:22.7460591Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7460775Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7461977Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7462038Z graph_break [] 2025-12-04T10:01:22.7462165Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7462235Z Autotune Choices Stats: 2025-12-04T10:01:22.7463652Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7463902Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7464122Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7464436Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7465611Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7466739Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7467960Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7469187Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7470316Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7471450Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7471697Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:22.7471770Z Autotune Choices Stats: 2025-12-04T10:01:22.7473220Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7473666Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7474037Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7474605Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7475797Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7477047Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7478259Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7479429Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7480601Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7481772Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7482981Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7484563Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7485908Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7487189Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7487472Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:22.7487611Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7487683Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7487748Z unimplemented [] 2025-12-04T10:01:22.7487870Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7488060Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7489259Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7489320Z graph_break [] 2025-12-04T10:01:22.7489453Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7489527Z Autotune Choices Stats: 2025-12-04T10:01:22.7490947Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7491201Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7491420Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7491741Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7492919Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7494100Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7495290Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7496468Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7497597Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7498764Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7499011Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:22.7499082Z Autotune Choices Stats: 2025-12-04T10:01:22.7500528Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.7501016Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7501351Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7501916Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7503152Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7504395Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7505587Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7506757Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7508051Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7509230Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7510451Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7511680Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7512973Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7514185Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7514437Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:22.7514578Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7514650Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7514712Z unimplemented [] 2025-12-04T10:01:22.7514821Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7515005Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7516211Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7516284Z graph_break [] 2025-12-04T10:01:22.7516416Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7516489Z Autotune Choices Stats: 2025-12-04T10:01:22.7517936Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.7518224Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7518443Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7518792Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7520339Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7522013Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7523191Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7524436Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7525648Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7526842Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7527096Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:22.7527175Z Autotune Choices Stats: 2025-12-04T10:01:22.7528790Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7529262Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7529595Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7530196Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7531422Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7532633Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7533804Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7534974Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7536148Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7537369Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7538559Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7539802Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7541007Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7542184Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7542436Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:22.7542579Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7542651Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7542714Z unimplemented [] 2025-12-04T10:01:22.7542825Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7543019Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7544216Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7544284Z graph_break [] 2025-12-04T10:01:22.7544414Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7544486Z Autotune Choices Stats: 2025-12-04T10:01:22.7545952Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:22.7546210Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7546441Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7546763Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7548035Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7549233Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7550368Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7551502Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7552636Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7553785Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7554035Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.7554113Z Autotune Choices Stats: 2025-12-04T10:01:22.7555822Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7556286Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7556729Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7557341Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7558532Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7559714Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7561043Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7562221Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7563448Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7564619Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7565838Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7567138Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7568315Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7569492Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7569743Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:22.7569888Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7569958Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7570023Z unimplemented [] 2025-12-04T10:01:22.7570132Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7570322Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7571512Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7571578Z graph_break [] 2025-12-04T10:01:22.7571708Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7571823Z Autotune Choices Stats: 2025-12-04T10:01:22.7573247Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7573500Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7573761Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7574123Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7575277Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7576442Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7577581Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7578720Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7579979Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7581171Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7581424Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:22.7581496Z Autotune Choices Stats: 2025-12-04T10:01:22.7582951Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7583469Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7583833Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7584397Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7585579Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7586756Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7587969Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7589140Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7590373Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7591556Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7592800Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7594005Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7595236Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7596605Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7596860Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:22.7596998Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7597076Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7597138Z unimplemented [] 2025-12-04T10:01:22.7597249Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7597437Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7598678Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7598749Z graph_break [] 2025-12-04T10:01:22.7598879Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7598954Z Autotune Choices Stats: 2025-12-04T10:01:22.7600374Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:22.7600717Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7600972Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7601286Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7602434Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7603566Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7604702Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7605837Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7606996Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7608137Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7608389Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:22.7608462Z Autotune Choices Stats: 2025-12-04T10:01:22.7609948Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7610459Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7610791Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7611362Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7612551Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7613740Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7614924Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7616152Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7617331Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7618557Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7619767Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7620940Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7622116Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7623293Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7623543Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:22.7623677Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7623753Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7623821Z unimplemented [] 2025-12-04T10:01:22.7623931Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7624152Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7625350Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7625419Z graph_break [] 2025-12-04T10:01:22.7625546Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7625618Z Autotune Choices Stats: 2025-12-04T10:01:22.7627119Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7627455Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7627678Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7627990Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7629145Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7630277Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7631552Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7632694Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7633873Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7635020Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7635318Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.7635437Z Autotune Choices Stats: 2025-12-04T10:01:22.7636904Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7637386Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7637725Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7638287Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7639476Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7640658Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7641870Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7643050Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7644260Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7645493Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7646683Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7647855Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7649255Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7650449Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7650756Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:22.7650896Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7650971Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7651034Z unimplemented [] 2025-12-04T10:01:22.7651141Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7651330Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7652523Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7652626Z graph_break [] 2025-12-04T10:01:22.7652790Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7652863Z Autotune Choices Stats: 2025-12-04T10:01:22.7654317Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.7654571Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7654793Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7655106Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7656421Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7657568Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7658699Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7659905Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7661037Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7662266Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7662554Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:22.7662627Z Autotune Choices Stats: 2025-12-04T10:01:22.7664079Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.7664524Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7664855Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7665420Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7666610Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7667878Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7669116Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7670299Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7671541Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7672743Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7673920Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7675097Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7676276Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7677487Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7677735Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:22.7677866Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7677940Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7678002Z unimplemented [] 2025-12-04T10:01:22.7678109Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7678293Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7679553Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7679650Z graph_break [] 2025-12-04T10:01:22.7679778Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7679847Z Autotune Choices Stats: 2025-12-04T10:01:22.7681261Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7681518Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7681738Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7682052Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7683209Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7684351Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7685518Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7686653Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7687813Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7689328Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7689574Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:22.7689648Z Autotune Choices Stats: 2025-12-04T10:01:22.7691104Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7691549Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7691883Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7692448Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7693635Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7694855Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7696029Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7697275Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7698495Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7699664Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7700836Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7702005Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7703223Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7704399Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7704646Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:22.7704860Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7704930Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7704992Z unimplemented [] 2025-12-04T10:01:22.7705103Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7705331Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7706554Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7706619Z graph_break [] 2025-12-04T10:01:22.7706746Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7706822Z Autotune Choices Stats: 2025-12-04T10:01:22.7708272Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7708524Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7708743Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7709061Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7710220Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7711399Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7712526Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7713666Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7714866Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7716046Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7716295Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:22.7716372Z Autotune Choices Stats: 2025-12-04T10:01:22.7717833Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.7718282Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7718613Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7719177Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7720402Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7721587Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7722840Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7724083Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7725269Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7726446Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7727627Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7728797Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7730012Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7731195Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7731531Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:22.7731699Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7731767Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7731827Z unimplemented [] 2025-12-04T10:01:22.7731934Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7732118Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7733314Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7733382Z graph_break [] 2025-12-04T10:01:22.7733511Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7733584Z Autotune Choices Stats: 2025-12-04T10:01:22.7734994Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7735247Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7735468Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7735789Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7736928Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7738098Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7739228Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7740427Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7741591Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7742732Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7742978Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:22.7743051Z Autotune Choices Stats: 2025-12-04T10:01:22.7744512Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7744959Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7745290Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7745903Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7747090Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7748359Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7749601Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7750775Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7751949Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7753113Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7754290Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7755701Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7756897Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7758166Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7758456Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:22.7758595Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7758668Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7758730Z unimplemented [] 2025-12-04T10:01:22.7758841Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7759029Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7760231Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7760294Z graph_break [] 2025-12-04T10:01:22.7760423Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7760496Z Autotune Choices Stats: 2025-12-04T10:01:22.7761909Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7762167Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7762388Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7762709Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7763893Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7765037Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7777634Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7779140Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7780567Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7781982Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7782298Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:22.7782384Z Autotune Choices Stats: 2025-12-04T10:01:22.7784221Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7784778Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7785253Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7785848Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7787053Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7788417Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7789619Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7790790Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7791962Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7793137Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7794336Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7795507Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7796711Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7797932Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7798188Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:22.7798325Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7798396Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7798458Z unimplemented [] 2025-12-04T10:01:22.7798563Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7798756Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7799949Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7800012Z graph_break [] 2025-12-04T10:01:22.7800141Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7800207Z Autotune Choices Stats: 2025-12-04T10:01:22.7801637Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7801885Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7802110Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7802466Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7803611Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7804773Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7805984Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7807116Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7808249Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7809375Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7809621Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:22.7809689Z Autotune Choices Stats: 2025-12-04T10:01:22.7811195Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7811644Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7811976Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7812539Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7813790Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7814992Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7816167Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7817347Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7818508Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7819703Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7820871Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7822063Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7823262Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7824465Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7824723Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:22.7824854Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7824924Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7824985Z unimplemented [] 2025-12-04T10:01:22.7825091Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7825280Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7826487Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7826560Z graph_break [] 2025-12-04T10:01:22.7826691Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7826756Z Autotune Choices Stats: 2025-12-04T10:01:22.7828259Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7828511Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7828735Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7829052Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7830190Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7831379Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7832536Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7833667Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7834795Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7835929Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7836180Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:22.7836249Z Autotune Choices Stats: 2025-12-04T10:01:22.7837738Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.7838180Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7838512Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7839151Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7840387Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7841564Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7842732Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7843903Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7845070Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7846272Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7847437Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7848665Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7849866Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7851186Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7851448Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:22.7851578Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7851650Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7851710Z unimplemented [] 2025-12-04T10:01:22.7851812Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7852000Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7853193Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7853254Z graph_break [] 2025-12-04T10:01:22.7853378Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7853441Z Autotune Choices Stats: 2025-12-04T10:01:22.7854961Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7855398Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7855630Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7855947Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7857223Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7858398Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7859535Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7860683Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7861817Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7862947Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7863244Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:22.7863313Z Autotune Choices Stats: 2025-12-04T10:01:22.7864762Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.7865236Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7865601Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7866197Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7867432Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7868601Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7869768Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7870938Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7872139Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7873327Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7874580Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7875928Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7877315Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7878554Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7878805Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:22.7878941Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7879011Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7879072Z unimplemented [] 2025-12-04T10:01:22.7879176Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7879367Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7880577Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7880681Z graph_break [] 2025-12-04T10:01:22.7880815Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7880880Z Autotune Choices Stats: 2025-12-04T10:01:22.7882304Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7882582Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7882840Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7883161Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7884344Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7885480Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7886608Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7887739Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7888876Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7890049Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7890297Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:22.7890360Z Autotune Choices Stats: 2025-12-04T10:01:22.7891844Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.7892347Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7892678Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7893251Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7894440Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7895628Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7896799Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7898006Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7899188Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7900389Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7901626Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7902793Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7903965Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7905135Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7905395Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:22.7905522Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7905590Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7905652Z unimplemented [] 2025-12-04T10:01:22.7905753Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7905939Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7907185Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7907285Z graph_break [] 2025-12-04T10:01:22.7907413Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7907480Z Autotune Choices Stats: 2025-12-04T10:01:22.7908947Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7909255Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7909473Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7909804Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7910953Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7912074Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7913205Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7914336Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7915513Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7916642Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7916894Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:22.7917003Z Autotune Choices Stats: 2025-12-04T10:01:22.7918496Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:22.7918965Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7919299Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7919869Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7921056Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7922227Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7923404Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7924604Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7925792Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7927028Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7928232Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7929407Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7930584Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7931755Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.7932005Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:22.7932135Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7932204Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7932299Z unimplemented [] 2025-12-04T10:01:22.7932403Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7932588Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7933795Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7933856Z graph_break [] 2025-12-04T10:01:22.7934019Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7934083Z Autotune Choices Stats: 2025-12-04T10:01:22.7935541Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7935818Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7936037Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7936356Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7937513Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7938641Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7939773Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7940954Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7942085Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7943251Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7943529Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:22.7943625Z Autotune Choices Stats: 2025-12-04T10:01:22.7945076Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.7945515Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7945844Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7946418Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7947644Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7948826Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7950041Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7951212Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7952442Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7953642Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7954960Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7956313Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7957489Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7958736Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7958996Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:22.7959187Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:22.7959269Z Traceback (most recent call last): 2025-12-04T10:01:22.7959571Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:22.7959633Z self.assertTrue( 2025-12-04T10:01:22.7959836Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:22.7959914Z raise self.failureException(msg) 2025-12-04T10:01:22.7960157Z AssertionError: False is not true : Log file /tmp/tmpicavptze/flex_attention_configs.json was not created 2025-12-04T10:01:22.7960217Z 2025-12-04T10:01:22.7960351Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:22.7960649Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:22.7960654Z 2025-12-04T10:01:22.7960862Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:22.7960994Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7961106Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7961173Z unimplemented [] 2025-12-04T10:01:22.7961302Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7962517Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:22.7962711Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7962771Z graph_break [] 2025-12-04T10:01:22.7962907Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7963968Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:22.7964050Z current_size = base.storage().size() 2025-12-04T10:01:22.7964117Z Autotune Choices Stats: 2025-12-04T10:01:22.7965552Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:22.7965802Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7966022Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7966346Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7967571Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7968738Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7969942Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7971147Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7972287Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7973424Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7973680Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:22.7973748Z Autotune Choices Stats: 2025-12-04T10:01:22.7975215Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:22.7975693Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7976040Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7976620Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7977840Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7979075Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7980244Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7981452Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7982627Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7983802Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7985007Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.7986183Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.7987520Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7988767Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.7989021Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:22.7989154Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.7989225Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.7989292Z unimplemented [] 2025-12-04T10:01:22.7989394Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.7989579Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.7990770Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.7990833Z graph_break [] 2025-12-04T10:01:22.7990966Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.7991033Z Autotune Choices Stats: 2025-12-04T10:01:22.7992456Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.7992703Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.7992975Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.7993294Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.7994454Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7995700Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.7996878Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.7998026Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.7999167Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8000299Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8000553Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:22.8000620Z Autotune Choices Stats: 2025-12-04T10:01:22.8002111Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8002551Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8002889Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8003455Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8004739Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8005960Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8007141Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8008319Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8009525Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8010735Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8011913Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8013119Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8014407Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8015575Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8015831Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:22.8015960Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8016029Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8016096Z unimplemented [] 2025-12-04T10:01:22.8016199Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8016403Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8017608Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8017670Z graph_break [] 2025-12-04T10:01:22.8017803Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8017870Z Autotune Choices Stats: 2025-12-04T10:01:22.8019326Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8019579Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8019803Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8020119Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8021306Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8022877Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8024447Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8025615Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8026781Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8028030Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8028291Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:22.8028364Z Autotune Choices Stats: 2025-12-04T10:01:22.8029907Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8030355Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8030735Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8031342Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8032580Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8033763Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8034983Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8036168Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8037342Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8038563Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8039734Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8040988Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8042203Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8043372Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8043631Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:22.8043768Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8043839Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8043910Z unimplemented [] 2025-12-04T10:01:22.8044017Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8044211Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8045428Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8045489Z graph_break [] 2025-12-04T10:01:22.8045627Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8045694Z Autotune Choices Stats: 2025-12-04T10:01:22.8047158Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8047410Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8047638Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8047994Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8049183Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8050349Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8051498Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8052633Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8053773Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8054946Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8055456Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:22.8055553Z Autotune Choices Stats: 2025-12-04T10:01:22.8057029Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8057635Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8058021Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8058592Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8059787Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8060961Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8062144Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8063337Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8064699Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8065931Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8067187Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8068449Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8069619Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8070790Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8071052Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:22.8071191Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8071266Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8071337Z unimplemented [] 2025-12-04T10:01:22.8071444Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8071636Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8072901Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8072969Z graph_break [] 2025-12-04T10:01:22.8073110Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8073180Z Autotune Choices Stats: 2025-12-04T10:01:22.8074606Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8074893Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8075201Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8075626Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8076983Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8078130Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8079268Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8080414Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8081549Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8082719Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8082973Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:22.8083038Z Autotune Choices Stats: 2025-12-04T10:01:22.8084532Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8085038Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8085374Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8085944Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8087139Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8088317Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8089501Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8090711Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8091878Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8093078Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8094312Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8095495Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8096665Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8097844Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8098097Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:22.8098232Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8098303Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8098371Z unimplemented [] 2025-12-04T10:01:22.8098474Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8098665Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8099909Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8099974Z graph_break [] 2025-12-04T10:01:22.8100107Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8100172Z Autotune Choices Stats: 2025-12-04T10:01:22.8101636Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8101958Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8102186Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8102504Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8103649Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8104798Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8105953Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8107087Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8108326Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8109465Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8109750Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:22.8109819Z Autotune Choices Stats: 2025-12-04T10:01:22.8111317Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8111790Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8112133Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8112701Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8113894Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8115078Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8116306Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8117486Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8118721Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8119936Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8121144Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8122334Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8123506Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8124687Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8124941Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:22.8125077Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8125182Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8125254Z unimplemented [] 2025-12-04T10:01:22.8125358Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8125549Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8126767Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8126862Z graph_break [] 2025-12-04T10:01:22.8126996Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8127066Z Autotune Choices Stats: 2025-12-04T10:01:22.8128584Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8128864Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8129088Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8129420Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8130569Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8131728Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8132874Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8134048Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8135188Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8136367Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8136698Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:22.8136765Z Autotune Choices Stats: 2025-12-04T10:01:22.8138228Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.8138669Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8139009Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8139572Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8140765Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8141951Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8143204Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8144388Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8145642Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8146847Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8148069Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8149250Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8150414Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8151618Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8151877Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:22.8152006Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8152074Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8152141Z unimplemented [] 2025-12-04T10:01:22.8152242Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8152430Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8153664Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8154170Z graph_break [] 2025-12-04T10:01:22.8154309Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8154376Z Autotune Choices Stats: 2025-12-04T10:01:22.8156021Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.8156271Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8156495Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8156818Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8157968Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8159104Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8160308Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8161443Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8162586Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8163821Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8164133Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:22.8164202Z Autotune Choices Stats: 2025-12-04T10:01:22.8165678Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8166127Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8166468Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8167037Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8168320Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8169541Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8170728Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8171985Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8173197Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8174370Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8175551Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8176728Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8177897Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8179105Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8179354Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:22.8179495Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8179563Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8179665Z unimplemented [] 2025-12-04T10:01:22.8179770Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8179959Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8181193Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8181288Z graph_break [] 2025-12-04T10:01:22.8181424Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8181490Z Autotune Choices Stats: 2025-12-04T10:01:22.8182914Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:22.8183159Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8183379Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8183702Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8184854Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8186648Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8188546Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8190492Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8191775Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8192951Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8193220Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.8193291Z Autotune Choices Stats: 2025-12-04T10:01:22.8194758Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8195201Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8195549Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8196120Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8197351Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8198523Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8199735Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8200978Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8202157Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8203329Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8204513Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8205696Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8206927Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8208106Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8208390Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:22.8208571Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8208647Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8208748Z unimplemented [] 2025-12-04T10:01:22.8208852Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8209042Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8210241Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8210305Z graph_break [] 2025-12-04T10:01:22.8210446Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8210515Z Autotune Choices Stats: 2025-12-04T10:01:22.8211935Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8212189Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8212413Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8212752Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8213911Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8215091Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8216231Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8217434Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8218604Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8219743Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8219994Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:22.8220060Z Autotune Choices Stats: 2025-12-04T10:01:22.8221525Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8221964Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8222299Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8222860Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8224090Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8225271Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8226515Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8227808Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8228982Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8230163Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8231335Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8232549Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8233728Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8234937Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8235287Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:22.8235424Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8235495Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8235562Z unimplemented [] 2025-12-04T10:01:22.8235665Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8235854Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8237059Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8237122Z graph_break [] 2025-12-04T10:01:22.8237259Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8237326Z Autotune Choices Stats: 2025-12-04T10:01:22.8238749Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:22.8239002Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8239223Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8239559Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8240750Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8241895Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8243064Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8244260Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8245411Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8246547Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8246798Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:22.8246866Z Autotune Choices Stats: 2025-12-04T10:01:22.8248328Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8248770Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8249143Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8249713Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8251253Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8253401Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8255148Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8256608Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8257810Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8258994Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8260259Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8261444Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8262668Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8263935Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8264237Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:22.8264385Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8264460Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8264524Z unimplemented [] 2025-12-04T10:01:22.8264642Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8264836Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8266040Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8266104Z graph_break [] 2025-12-04T10:01:22.8266247Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8266318Z Autotune Choices Stats: 2025-12-04T10:01:22.8267820Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8268078Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8268300Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8268672Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8269839Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8270979Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8272247Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8273425Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8274568Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8275717Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8275973Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.8276044Z Autotune Choices Stats: 2025-12-04T10:01:22.8277508Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8277985Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8278328Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8278891Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8280124Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8281368Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8282560Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8283736Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8284922Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8286094Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8287304Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8288480Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8289709Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8290917Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8291172Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:22.8291309Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8291378Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8291441Z unimplemented [] 2025-12-04T10:01:22.8291551Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8291739Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8292949Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8293014Z graph_break [] 2025-12-04T10:01:22.8293153Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8293219Z Autotune Choices Stats: 2025-12-04T10:01:22.8294650Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.8294942Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8295166Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8295490Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8296647Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8297855Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8299014Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8300155Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8301295Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8302428Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8302679Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:22.8302746Z Autotune Choices Stats: 2025-12-04T10:01:22.8304248Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.8304687Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8305027Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8305655Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8306875Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8308205Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8309404Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8310589Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8311764Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8312974Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8314242Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8316428Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8318527Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8320461Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8320740Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:22.8320889Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8320968Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8321035Z unimplemented [] 2025-12-04T10:01:22.8321149Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8321348Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8322557Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8322623Z graph_break [] 2025-12-04T10:01:22.8322760Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8322837Z Autotune Choices Stats: 2025-12-04T10:01:22.8324341Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8324612Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8324835Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8325167Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8326401Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8327584Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8328739Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8329895Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8331052Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8332199Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8332476Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:22.8332584Z Autotune Choices Stats: 2025-12-04T10:01:22.8334072Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8334525Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8334928Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8335551Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8336771Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8337979Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8339189Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8340379Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8341628Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8342816Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8344047Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8345310Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8346503Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8347818Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8348080Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:22.8348224Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8348298Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8348362Z unimplemented [] 2025-12-04T10:01:22.8348478Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8348672Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8349904Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8349968Z graph_break [] 2025-12-04T10:01:22.8350141Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8350218Z Autotune Choices Stats: 2025-12-04T10:01:22.8351666Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8351924Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8352186Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8352557Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8353753Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8354908Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8356403Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8357594Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8358751Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8360004Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8360281Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:22.8360354Z Autotune Choices Stats: 2025-12-04T10:01:22.8361846Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.8362400Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8362797Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8363371Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8364586Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8365784Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8366994Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8368196Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8369442Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8370647Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8371907Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8373165Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8374366Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8375572Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8375835Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:22.8375987Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8376062Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8376130Z unimplemented [] 2025-12-04T10:01:22.8376253Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8376450Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8377737Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8377806Z graph_break [] 2025-12-04T10:01:22.8377939Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8378019Z Autotune Choices Stats: 2025-12-04T10:01:22.8379473Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8379800Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8380060Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8380393Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8381575Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8382740Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8383898Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8385064Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8386266Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8387551Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8387822Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:22.8387894Z Autotune Choices Stats: 2025-12-04T10:01:22.8389459Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8389947Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8390285Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8390876Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8392093Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8393297Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8394506Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8395738Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8396938Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8398196Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8399435Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8400641Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8401842Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8403046Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8403306Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:22.8403449Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8403522Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8403593Z unimplemented [] 2025-12-04T10:01:22.8403714Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8403947Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8405917Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8406036Z graph_break [] 2025-12-04T10:01:22.8406264Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8406464Z Autotune Choices Stats: 2025-12-04T10:01:22.8409042Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8409529Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8409914Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8410494Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8412531Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8414549Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8416580Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8418546Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8420588Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8422592Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8423145Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:22.8423257Z Autotune Choices Stats: 2025-12-04T10:01:22.8425741Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8426526Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8427108Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8428182Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8430163Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8431384Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8432663Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8433864Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8435099Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8436357Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8437549Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8438739Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8439924Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8441110Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8441421Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:22.8441576Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8441653Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8441719Z unimplemented [] 2025-12-04T10:01:22.8441837Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8442034Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8443261Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8443361Z graph_break [] 2025-12-04T10:01:22.8443535Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8443644Z Autotune Choices Stats: 2025-12-04T10:01:22.8445113Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8445425Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8445693Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8446086Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8447322Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8448474Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8449617Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8450798Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8451933Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8453173Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8453459Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:22.8453539Z Autotune Choices Stats: 2025-12-04T10:01:22.8455009Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8455716Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8456056Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8456644Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8457850Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8459135Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8460338Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8461574Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8462810Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8464044Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8465363Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8466557Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8467802Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8469032Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8469295Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:22.8469443Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8469517Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8469584Z unimplemented [] 2025-12-04T10:01:22.8469699Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8469894Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8471193Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8471290Z graph_break [] 2025-12-04T10:01:22.8471422Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8471499Z Autotune Choices Stats: 2025-12-04T10:01:22.8472930Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8473187Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8473411Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8473746Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8474908Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8476056Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8477231Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8478384Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8479561Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8480772Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8481022Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:22.8481093Z Autotune Choices Stats: 2025-12-04T10:01:22.8482563Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.8483015Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8483355Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8483938Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8485943Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8487332Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8488543Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8489803Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8491020Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8492200Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8493378Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8494552Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8495780Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8496962Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8497221Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:22.8497406Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8497477Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8497546Z unimplemented [] 2025-12-04T10:01:22.8497693Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8497887Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8499137Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8499204Z graph_break [] 2025-12-04T10:01:22.8499341Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8499415Z Autotune Choices Stats: 2025-12-04T10:01:22.8500850Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8501111Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8501335Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8501672Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8502830Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8504022Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8505171Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8506354Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8507602Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8508787Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8509040Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:22.8509118Z Autotune Choices Stats: 2025-12-04T10:01:22.8510575Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8511031Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8511380Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8511964Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8513198Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8514514Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8516009Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8517270Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8518458Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8519636Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8520839Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8522052Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8523240Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8524433Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8524754Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:22.8524933Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8525004Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8525076Z unimplemented [] 2025-12-04T10:01:22.8525194Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8525386Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8526601Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8526671Z graph_break [] 2025-12-04T10:01:22.8526808Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8526885Z Autotune Choices Stats: 2025-12-04T10:01:22.8528313Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8528574Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8528796Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8529135Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8530295Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8531489Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8532639Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8533865Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8535043Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8536200Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8536453Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:22.8536530Z Autotune Choices Stats: 2025-12-04T10:01:22.8538003Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.8538454Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8538788Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8539408Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8540606Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8541829Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8543089Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8544287Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8545483Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8546680Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8547936Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8549163Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8550351Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8551629Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8551913Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:22.8552049Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8552125Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8552188Z unimplemented [] 2025-12-04T10:01:22.8552306Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8552496Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8553727Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8553794Z graph_break [] 2025-12-04T10:01:22.8553924Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8553996Z Autotune Choices Stats: 2025-12-04T10:01:22.8555658Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8555923Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8556146Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8556476Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8557722Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8558878Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8560120Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8561374Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8562529Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8563683Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8563939Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:22.8564014Z Autotune Choices Stats: 2025-12-04T10:01:22.8565487Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:22.8565946Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8566318Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8566905Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8568111Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8569382Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8570599Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8571787Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8572980Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8574162Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8575388Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8576565Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8577810Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8579022Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8579279Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:22.8579414Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8579493Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8579559Z unimplemented [] 2025-12-04T10:01:22.8579672Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8579860Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8581075Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8581143Z graph_break [] 2025-12-04T10:01:22.8581277Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8581358Z Autotune Choices Stats: 2025-12-04T10:01:22.8582785Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8583041Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8583301Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8583630Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8584798Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8586004Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8587300Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8588459Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8589605Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8590765Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8591017Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:22.8591093Z Autotune Choices Stats: 2025-12-04T10:01:22.8592609Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.8593063Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8593399Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8593983Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8595253Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8596505Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8597700Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8598903Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8600098Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8601323Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8602523Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8603749Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8605022Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8606221Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8606477Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:22.8606610Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8606688Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8606754Z unimplemented [] 2025-12-04T10:01:22.8606866Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8607056Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8608283Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8608352Z graph_break [] 2025-12-04T10:01:22.8608483Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8608557Z Autotune Choices Stats: 2025-12-04T10:01:22.8610039Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8610303Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8610533Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8610857Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8612069Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8613253Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8614455Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8615629Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8616792Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8617955Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8618209Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:22.8618284Z Autotune Choices Stats: 2025-12-04T10:01:22.8619973Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8620699Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8621210Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8622270Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8624390Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8626511Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8628702Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8630795Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8632853Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8635103Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8637254Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8640044Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8642130Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8644216Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8644703Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:22.8645029Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:22.8645181Z Traceback (most recent call last): 2025-12-04T10:01:22.8645716Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:22.8645845Z self.assertTrue( 2025-12-04T10:01:22.8646179Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:22.8646317Z raise self.failureException(msg) 2025-12-04T10:01:22.8646739Z AssertionError: False is not true : Log file /tmp/tmpzc1j4inl/flex_attention_configs.json was not created 2025-12-04T10:01:22.8646746Z 2025-12-04T10:01:22.8646975Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:22.8647433Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:22.8647447Z 2025-12-04T10:01:22.8647733Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:22.8647970Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8648103Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8648212Z unimplemented [] 2025-12-04T10:01:22.8648489Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8650681Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:22.8651028Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8651144Z graph_break [] 2025-12-04T10:01:22.8651442Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8653223Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:22.8653425Z current_size = base.storage().size() 2025-12-04T10:01:22.8653535Z Autotune Choices Stats: 2025-12-04T10:01:22.8656126Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:22.8656612Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8657021Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8657601Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8659651Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8661678Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8663815Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8665852Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8667986Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8670221Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8670786Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:22.8670907Z Autotune Choices Stats: 2025-12-04T10:01:22.8673456Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:22.8674261Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8674854Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8675849Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8677955Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8680101Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8682172Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8684356Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8686557Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8688676Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8690773Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8692859Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8694994Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8697071Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8697525Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:22.8697754Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8697881Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8698010Z unimplemented [] 2025-12-04T10:01:22.8698302Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8698657Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8700922Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8701112Z graph_break [] 2025-12-04T10:01:22.8701361Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8701481Z Autotune Choices Stats: 2025-12-04T10:01:22.8703973Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8704452Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8704863Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8705436Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8707639Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8709695Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8711840Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8713890Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8716133Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8718241Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8718731Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:22.8718864Z Autotune Choices Stats: 2025-12-04T10:01:22.8721398Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8722080Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8722441Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8723011Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8724306Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8725510Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8726737Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8727957Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8729145Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8730311Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8731474Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8732647Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8733854Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8735030Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8735327Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:22.8735473Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8735581Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8735652Z unimplemented [] 2025-12-04T10:01:22.8735797Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8735994Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8737195Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8737259Z graph_break [] 2025-12-04T10:01:22.8737403Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8737475Z Autotune Choices Stats: 2025-12-04T10:01:22.8738907Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8739161Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8739398Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8739719Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8740877Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8742044Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8743184Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8744345Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8745573Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8746705Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8746976Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:22.8747044Z Autotune Choices Stats: 2025-12-04T10:01:22.8748619Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8749062Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8749402Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8749964Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8751205Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8752376Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8753629Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8754832Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8756261Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8757449Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8758622Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8759883Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8761061Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8762305Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8762657Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:22.8762800Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8762873Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8762945Z unimplemented [] 2025-12-04T10:01:22.8763052Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8763250Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8764454Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8764519Z graph_break [] 2025-12-04T10:01:22.8764660Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8764728Z Autotune Choices Stats: 2025-12-04T10:01:22.8766154Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8766404Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8766634Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8766949Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8768142Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8769280Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8770447Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8771615Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8772775Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8773896Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8774152Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:22.8774218Z Autotune Choices Stats: 2025-12-04T10:01:22.8775690Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8776134Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8776469Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8777067Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8778264Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8779474Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8780708Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8781881Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8783044Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8784215Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8785375Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8786574Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8787845Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8789074Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8789364Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:22.8789501Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8789575Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8789644Z unimplemented [] 2025-12-04T10:01:22.8789749Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8789940Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8791146Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8791210Z graph_break [] 2025-12-04T10:01:22.8791347Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8791415Z Autotune Choices Stats: 2025-12-04T10:01:22.8792838Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8793090Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8793322Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8793639Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8794855Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8795994Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8797197Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8798353Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8799484Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8800605Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8800857Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:22.8800925Z Autotune Choices Stats: 2025-12-04T10:01:22.8802387Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8802859Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8803200Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8803761Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8804986Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8806182Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8807384Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8808562Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8809726Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8810892Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8820503Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8822763Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8824576Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8825817Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8826087Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:22.8826235Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8826315Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8826394Z unimplemented [] 2025-12-04T10:01:22.8826505Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8826697Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8828008Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8828078Z graph_break [] 2025-12-04T10:01:22.8828228Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8828299Z Autotune Choices Stats: 2025-12-04T10:01:22.8829723Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8829983Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8830257Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8830594Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8831749Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8832944Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8834144Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8835265Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8836384Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8837502Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8837758Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:22.8837827Z Autotune Choices Stats: 2025-12-04T10:01:22.8839312Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8839766Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8840104Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8840661Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8841911Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8843129Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8844294Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8845461Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8846635Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8847826Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8848981Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8850178Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8851394Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8852558Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8852812Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:22.8852955Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8853027Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8853096Z unimplemented [] 2025-12-04T10:01:22.8853204Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8853391Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8854597Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8854662Z graph_break [] 2025-12-04T10:01:22.8854799Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8854868Z Autotune Choices Stats: 2025-12-04T10:01:22.8856626Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8856893Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8857118Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8857435Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8858627Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8859842Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8860977Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8862098Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8863223Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8864345Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8864614Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:22.8864686Z Autotune Choices Stats: 2025-12-04T10:01:22.8866173Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.8866618Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8866992Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8867671Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8868885Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8870058Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8871233Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8872395Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8873587Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8874745Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8875904Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8877441Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8879268Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8880732Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8880991Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:22.8881141Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8881218Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8881291Z unimplemented [] 2025-12-04T10:01:22.8881403Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8881600Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8882822Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8882888Z graph_break [] 2025-12-04T10:01:22.8883035Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8883106Z Autotune Choices Stats: 2025-12-04T10:01:22.8884596Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.8884864Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8885091Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8885453Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8886634Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8887801Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8888935Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8890068Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8891197Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8892355Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8892618Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:22.8892693Z Autotune Choices Stats: 2025-12-04T10:01:22.8894141Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8894652Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8895028Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8895582Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8896770Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8897965Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8899147Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8900303Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8901501Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8902661Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8903874Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8905068Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8906230Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8907496Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8907754Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:22.8907899Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8907971Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8908044Z unimplemented [] 2025-12-04T10:01:22.8908149Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8908338Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8909586Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8909651Z graph_break [] 2025-12-04T10:01:22.8909789Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8909856Z Autotune Choices Stats: 2025-12-04T10:01:22.8911265Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:22.8911578Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8911838Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8912194Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8913339Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8914463Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8915583Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8916714Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8917839Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8918995Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8919255Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.8919325Z Autotune Choices Stats: 2025-12-04T10:01:22.8920842Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8921359Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8921700Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8922271Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8923466Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8924639Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8925803Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8926995Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8928159Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8929357Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8930590Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8931765Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8932930Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8934107Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8934362Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:22.8934506Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8934576Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8934640Z unimplemented [] 2025-12-04T10:01:22.8934752Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8934942Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8936187Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8936253Z graph_break [] 2025-12-04T10:01:22.8936389Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8936456Z Autotune Choices Stats: 2025-12-04T10:01:22.8938862Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8939413Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8939778Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8940329Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8941941Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8943073Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8944214Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8945350Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8946576Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8947786Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8948087Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:22.8948160Z Autotune Choices Stats: 2025-12-04T10:01:22.8949647Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8950125Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8950473Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8951042Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8952229Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8953401Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8954606Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8956030Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8957296Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8958554Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8959723Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8960887Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8962053Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8963227Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8963480Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:22.8963676Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8963754Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8963821Z unimplemented [] 2025-12-04T10:01:22.8963936Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8964136Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8965347Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8965446Z graph_break [] 2025-12-04T10:01:22.8965585Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8965656Z Autotune Choices Stats: 2025-12-04T10:01:22.8967102Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:22.8967392Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8967616Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8967939Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8969076Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8970203Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8971339Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.8972509Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.8973646Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8974805Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8975180Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:22.8975249Z Autotune Choices Stats: 2025-12-04T10:01:22.8976703Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.8977148Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8977490Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8978051Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8979244Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8980412Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8981617Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8982780Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8984011Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8985205Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8986385Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.8987605Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.8988766Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.8989972Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.8990221Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:22.8990364Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.8990435Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.8990497Z unimplemented [] 2025-12-04T10:01:22.8990607Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.8990793Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.8992035Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.8992160Z graph_break [] 2025-12-04T10:01:22.8992290Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.8992375Z Autotune Choices Stats: 2025-12-04T10:01:22.8993784Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.8994038Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.8994265Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.8994598Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.8996498Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8997813Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.8999399Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9001146Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9003000Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9004807Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9005339Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.9005479Z Autotune Choices Stats: 2025-12-04T10:01:22.9008053Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9008806Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9009147Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9009728Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9010925Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9012162Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9013347Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9014603Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9015813Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9016983Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9018147Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9019321Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9020527Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9021689Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9021945Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:22.9022089Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9022199Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9022265Z unimplemented [] 2025-12-04T10:01:22.9022382Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9022608Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9023813Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9023911Z graph_break [] 2025-12-04T10:01:22.9024049Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9024127Z Autotune Choices Stats: 2025-12-04T10:01:22.9025558Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.9025821Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9026047Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9026378Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9027624Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9028768Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9029941Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9031086Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9032287Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9033443Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9033700Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:22.9033780Z Autotune Choices Stats: 2025-12-04T10:01:22.9035240Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.9035691Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9036023Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9036592Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9037808Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9038983Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9040188Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9041405Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9042580Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9043745Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9044914Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9046086Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9047288Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9048460Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9048760Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:22.9048934Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9049037Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9049101Z unimplemented [] 2025-12-04T10:01:22.9049212Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9049400Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9050607Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9050672Z graph_break [] 2025-12-04T10:01:22.9050804Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9050881Z Autotune Choices Stats: 2025-12-04T10:01:22.9052298Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9052552Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9052777Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9053106Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9054250Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9055690Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9056847Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9058418Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9060017Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9061383Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9061816Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:22.9061932Z Autotune Choices Stats: 2025-12-04T10:01:22.9063531Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9064280Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9064647Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9065216Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9067081Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9068344Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9069598Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9070801Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9071965Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9073134Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9074306Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9075517Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9076687Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9077917Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9078203Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:22.9078346Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9078418Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9078482Z unimplemented [] 2025-12-04T10:01:22.9078599Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9078789Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9079994Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9080060Z graph_break [] 2025-12-04T10:01:22.9080195Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9080271Z Autotune Choices Stats: 2025-12-04T10:01:22.9081685Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9081946Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9082173Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9082501Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9083708Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9084843Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9086007Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9087202Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9088329Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9089459Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9089708Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:22.9089783Z Autotune Choices Stats: 2025-12-04T10:01:22.9091232Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.9091689Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9092057Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9092628Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9093815Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9095053Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9096251Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9097420Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9098594Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9099756Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9100964Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9102132Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9103346Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9104605Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9104860Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:22.9105005Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9105081Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9105145Z unimplemented [] 2025-12-04T10:01:22.9105263Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9105459Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9106664Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9106728Z graph_break [] 2025-12-04T10:01:22.9106864Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9106939Z Autotune Choices Stats: 2025-12-04T10:01:22.9108443Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9108703Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9108927Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9109294Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9110444Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9111608Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9112756Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9113936Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9115064Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9116198Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9116449Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:22.9116522Z Autotune Choices Stats: 2025-12-04T10:01:22.9117967Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9118468Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9118809Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9119373Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9120632Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9121869Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9123047Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9124213Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9125391Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9126564Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9127772Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9128930Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9130160Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9131373Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9131624Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:22.9131765Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9131835Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9131897Z unimplemented [] 2025-12-04T10:01:22.9132008Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9132197Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9133400Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9133474Z graph_break [] 2025-12-04T10:01:22.9133603Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9133678Z Autotune Choices Stats: 2025-12-04T10:01:22.9135086Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9135379Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9135611Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9135935Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9137066Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9138268Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9139430Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9140558Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9141684Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9142817Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9143067Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:22.9143141Z Autotune Choices Stats: 2025-12-04T10:01:22.9144619Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9145070Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9145411Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9146384Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9147636Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9148856Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9150029Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9151205Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9152373Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9153573Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9154744Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9156272Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9157519Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9158705Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9158962Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:22.9159103Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9159178Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9159244Z unimplemented [] 2025-12-04T10:01:22.9159352Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9159540Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9160737Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9160804Z graph_break [] 2025-12-04T10:01:22.9160934Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9161006Z Autotune Choices Stats: 2025-12-04T10:01:22.9162489Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9162753Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9162975Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9163298Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9164534Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9165705Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9166850Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9167976Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9169102Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9170237Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9170525Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:22.9170604Z Autotune Choices Stats: 2025-12-04T10:01:22.9172051Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9172499Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9172897Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9173498Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9174680Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9175863Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9177034Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9178205Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9179410Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9180578Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9181798Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9183019Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9184181Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9185335Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9185587Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:22.9185719Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9185797Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9185862Z unimplemented [] 2025-12-04T10:01:22.9185969Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9186156Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9187407Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9187478Z graph_break [] 2025-12-04T10:01:22.9187645Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9187720Z Autotune Choices Stats: 2025-12-04T10:01:22.9189126Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9189380Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9189638Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9189986Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9191166Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9192299Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9193424Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9194556Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9195684Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9196852Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9197101Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:22.9197176Z Autotune Choices Stats: 2025-12-04T10:01:22.9198650Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.9199128Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9199494Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9200059Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9201239Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9202416Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9203590Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9204798Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9205964Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9207158Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9208354Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9209552Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9210724Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9211886Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9212135Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:22.9212263Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9212353Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9212416Z unimplemented [] 2025-12-04T10:01:22.9212531Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9212715Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9213966Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9214034Z graph_break [] 2025-12-04T10:01:22.9214162Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9214233Z Autotune Choices Stats: 2025-12-04T10:01:22.9215673Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9215956Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9216237Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9216554Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9217700Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9218826Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9219956Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9221091Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9222250Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9223381Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9223624Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:22.9223729Z Autotune Choices Stats: 2025-12-04T10:01:22.9225203Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9225679Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9226009Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9226573Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9227798Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9228975Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9230140Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9231344Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9232506Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9233731Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9234934Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9236100Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9237265Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9238428Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9238673Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:22.9238804Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9238880Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9238944Z unimplemented [] 2025-12-04T10:01:22.9239084Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9239273Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9240466Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9240530Z graph_break [] 2025-12-04T10:01:22.9240659Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9240764Z Autotune Choices Stats: 2025-12-04T10:01:22.9242196Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9242494Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9242713Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9243026Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9244184Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9245316Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9246451Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9247613Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9248739Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9249871Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9250178Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:22.9250285Z Autotune Choices Stats: 2025-12-04T10:01:22.9251723Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.9252173Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9252506Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9253071Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9254254Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9255728Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9257021Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9258201Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9259482Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9260681Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9261843Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9263000Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9264164Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9265331Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9265623Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:22.9265758Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9265835Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9265898Z unimplemented [] 2025-12-04T10:01:22.9266001Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9266195Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9267460Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9267604Z graph_break [] 2025-12-04T10:01:22.9267738Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9267844Z Autotune Choices Stats: 2025-12-04T10:01:22.9269250Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9269504Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9269728Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9270043Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9271186Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9272325Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9273457Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9274625Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9275759Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9276966Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9277248Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:22.9277322Z Autotune Choices Stats: 2025-12-04T10:01:22.9278767Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:22.9279215Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9279550Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9280122Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9281295Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9282506Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9283669Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9284872Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9286099Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9287262Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9288429Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9289590Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9290755Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9291954Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9292211Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:22.9292341Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9292416Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9292490Z unimplemented [] 2025-12-04T10:01:22.9292595Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9292823Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9294045Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9294145Z graph_break [] 2025-12-04T10:01:22.9294273Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9294343Z Autotune Choices Stats: 2025-12-04T10:01:22.9295746Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9296000Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9296221Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9296538Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9297683Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9298804Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9299973Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9301105Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9302295Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9303455Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9303697Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:22.9303772Z Autotune Choices Stats: 2025-12-04T10:01:22.9305219Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.9305667Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9306005Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9306568Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9307792Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9309005Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9310171Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9311416Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9312612Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9313997Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9315189Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9316361Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9317577Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9318743Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9319000Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:22.9319167Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9319255Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9319321Z unimplemented [] 2025-12-04T10:01:22.9319461Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9319685Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9320884Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9320953Z graph_break [] 2025-12-04T10:01:22.9321082Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9321150Z Autotune Choices Stats: 2025-12-04T10:01:22.9322574Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9322830Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9323053Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9323369Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9324515Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9325678Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9326810Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9327993Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9329181Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9330318Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9330565Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:22.9330643Z Autotune Choices Stats: 2025-12-04T10:01:22.9332093Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9332541Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9332876Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9333448Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9334659Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9335835Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9337063Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9338264Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9339423Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9340595Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9341768Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9342982Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9344149Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9345353Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9345653Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:22.9345821Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9345900Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9345964Z unimplemented [] 2025-12-04T10:01:22.9346066Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9346258Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9349368Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9349477Z graph_break [] 2025-12-04T10:01:22.9349630Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9349703Z Autotune Choices Stats: 2025-12-04T10:01:22.9351144Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9351402Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9351651Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9351971Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9353117Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9354253Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9355631Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9356920Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9358093Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9359297Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9359562Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:22.9359632Z Autotune Choices Stats: 2025-12-04T10:01:22.9361112Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9361557Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9361897Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9362460Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9363649Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9364858Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9366106Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9367323Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9368489Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9369654Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9370815Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9371981Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9373143Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9374375Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9374685Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:22.9374864Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:22.9374943Z Traceback (most recent call last): 2025-12-04T10:01:22.9375253Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:22.9375319Z self.assertTrue( 2025-12-04T10:01:22.9375525Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:22.9375613Z raise self.failureException(msg) 2025-12-04T10:01:22.9375900Z AssertionError: False is not true : Log file /tmp/tmpexqsmcb8/flex_attention_configs.json was not created 2025-12-04T10:01:22.9375906Z 2025-12-04T10:01:22.9376047Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:22.9376303Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:22.9376307Z 2025-12-04T10:01:22.9376470Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:22.9376613Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9376686Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9376755Z unimplemented [] 2025-12-04T10:01:22.9376861Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9378086Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:22.9378286Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9378348Z graph_break [] 2025-12-04T10:01:22.9378486Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9379498Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:22.9379584Z current_size = base.storage().size() 2025-12-04T10:01:22.9379658Z Autotune Choices Stats: 2025-12-04T10:01:22.9381071Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:22.9381369Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9381625Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9381980Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9383122Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9384288Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9385406Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9386529Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9387706Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9388843Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9389099Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:22.9389167Z Autotune Choices Stats: 2025-12-04T10:01:22.9390642Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:22.9391201Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9391532Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9392101Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9393310Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9394474Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9395648Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9396808Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9397978Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9399169Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9400393Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9401597Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9402753Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9403919Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9404168Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:22.9404310Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9404381Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9404445Z unimplemented [] 2025-12-04T10:01:22.9404557Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9404745Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9405944Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9406007Z graph_break [] 2025-12-04T10:01:22.9406133Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9406204Z Autotune Choices Stats: 2025-12-04T10:01:22.9407642Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9407987Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9408210Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9408530Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9409704Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9410832Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9411960Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9413092Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9414208Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9415338Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9415620Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:22.9415686Z Autotune Choices Stats: 2025-12-04T10:01:22.9417155Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9417629Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9417961Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9418563Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9419741Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9420911Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9422074Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9423238Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9424406Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9425631Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9426831Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9428076Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9429244Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9430418Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9430823Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:22.9431048Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9431154Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9431249Z unimplemented [] 2025-12-04T10:01:22.9431411Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9431711Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9433886Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9434117Z graph_break [] 2025-12-04T10:01:22.9434411Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9434545Z Autotune Choices Stats: 2025-12-04T10:01:22.9437143Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9437687Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9438090Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9438670Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9440807Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9442825Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9444835Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9446833Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9448840Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9451055Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9451663Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:22.9451797Z Autotune Choices Stats: 2025-12-04T10:01:22.9454359Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9455553Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9456173Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9457154Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9459269Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9461301Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9463326Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9465425Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9467731Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9469740Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9471913Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9474076Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9475715Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9476927Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9477202Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:22.9477353Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9477428Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9477502Z unimplemented [] 2025-12-04T10:01:22.9477610Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9477801Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9479091Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9479232Z graph_break [] 2025-12-04T10:01:22.9479382Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9479452Z Autotune Choices Stats: 2025-12-04T10:01:22.9480877Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9481134Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9481408Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9481738Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9482897Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9484036Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9485169Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9486298Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9487427Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9488639Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9488932Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:22.9489002Z Autotune Choices Stats: 2025-12-04T10:01:22.9490535Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9490985Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9491323Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9491894Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9493085Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9494263Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9495445Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9496659Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9497886Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9499096Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9500264Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9501436Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9502599Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9503771Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9504025Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:22.9504161Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9504232Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9504301Z unimplemented [] 2025-12-04T10:01:22.9504456Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9504646Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9505871Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9505971Z graph_break [] 2025-12-04T10:01:22.9506109Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9506178Z Autotune Choices Stats: 2025-12-04T10:01:22.9507728Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9507984Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9508209Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9508544Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9509695Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9510836Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9511971Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9513100Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9514298Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9515452Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9515717Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:22.9515786Z Autotune Choices Stats: 2025-12-04T10:01:22.9517286Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9517729Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9518073Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9518634Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9519817Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9520986Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9522197Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9523393Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9524613Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9525820Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9526981Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9528146Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9529305Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9530475Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9530759Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:22.9530909Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9531018Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9531091Z unimplemented [] 2025-12-04T10:01:22.9531232Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9531422Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9532627Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9532692Z graph_break [] 2025-12-04T10:01:22.9532835Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9532903Z Autotune Choices Stats: 2025-12-04T10:01:22.9534356Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9534615Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9534841Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9535167Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9536314Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9537453Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9538584Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9539751Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9540934Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9542094Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9542349Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:22.9542417Z Autotune Choices Stats: 2025-12-04T10:01:22.9543878Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9544321Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9544663Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9545226Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9546419Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9547658Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9548951Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9550149Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9551363Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9552533Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9553703Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9554873Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9556270Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9557538Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9557900Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:22.9558045Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9558115Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9558198Z unimplemented [] 2025-12-04T10:01:22.9558308Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9558494Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9559750Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9559817Z graph_break [] 2025-12-04T10:01:22.9559952Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9560021Z Autotune Choices Stats: 2025-12-04T10:01:22.9561433Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9561687Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9561910Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9562229Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9563368Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9564513Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9565677Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9566855Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9568022Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9569195Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9569642Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:22.9569757Z Autotune Choices Stats: 2025-12-04T10:01:22.9572119Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.9572864Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9573423Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9574392Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9576472Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9578601Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9580489Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9581751Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9582939Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9584109Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9585278Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9586456Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9587697Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9588947Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9589239Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:22.9589393Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9589468Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9589538Z unimplemented [] 2025-12-04T10:01:22.9589646Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9589840Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9591075Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9591148Z graph_break [] 2025-12-04T10:01:22.9591289Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9591361Z Autotune Choices Stats: 2025-12-04T10:01:22.9592797Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.9593050Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9593276Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9593607Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9594772Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9595925Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9597159Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9598322Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9599483Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9600617Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9600897Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:22.9600970Z Autotune Choices Stats: 2025-12-04T10:01:22.9602440Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9602885Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9603228Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9603801Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9605029Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9606262Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9607471Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9608642Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9609821Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9610991Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9612161Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9613329Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9614564Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9615769Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9616033Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:22.9616209Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9616283Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9616351Z unimplemented [] 2025-12-04T10:01:22.9616456Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9616645Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9617844Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9617908Z graph_break [] 2025-12-04T10:01:22.9618048Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9618120Z Autotune Choices Stats: 2025-12-04T10:01:22.9619557Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:22.9619811Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9620051Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9620383Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9621530Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9622746Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9623903Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9625088Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9626226Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9627423Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9627683Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.9627750Z Autotune Choices Stats: 2025-12-04T10:01:22.9629216Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9629661Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9629997Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9630564Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9631822Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9633024Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9634226Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9635397Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9636571Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9637745Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9638919Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9640120Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9641345Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9642550Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9642807Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:22.9642949Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9643026Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9643106Z unimplemented [] 2025-12-04T10:01:22.9643214Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9643405Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9644612Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9644674Z graph_break [] 2025-12-04T10:01:22.9644809Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9644879Z Autotune Choices Stats: 2025-12-04T10:01:22.9646303Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9646557Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9646782Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9647109Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9648289Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9649486Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9650647Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9651780Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9652915Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9654052Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9654309Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:22.9654376Z Autotune Choices Stats: 2025-12-04T10:01:22.9656552Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9657017Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9657502Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9658073Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9659346Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9660575Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9661750Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9662917Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9664096Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9665274Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9666481Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9667758Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9668996Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9670169Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9670429Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:22.9670581Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9670654Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9670720Z unimplemented [] 2025-12-04T10:01:22.9670838Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9671032Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9672228Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9672292Z graph_break [] 2025-12-04T10:01:22.9672430Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9672498Z Autotune Choices Stats: 2025-12-04T10:01:22.9673920Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:22.9674177Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9674411Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9674776Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9675954Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9677115Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9678276Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9679404Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9680535Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9681666Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9681925Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:22.9681993Z Autotune Choices Stats: 2025-12-04T10:01:22.9683450Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9683957Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9684327Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9684886Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9686116Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9687280Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9688468Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9689629Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9690795Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9691966Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9693259Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9694453Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9695657Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9696833Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9697083Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:22.9697237Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9697307Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9697369Z unimplemented [] 2025-12-04T10:01:22.9697478Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9697667Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9698865Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9698931Z graph_break [] 2025-12-04T10:01:22.9699068Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9699136Z Autotune Choices Stats: 2025-12-04T10:01:22.9700543Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9700866Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9701093Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9701443Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9702580Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9703743Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9704873Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9706014Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9707141Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9708322Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9708582Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:22.9708650Z Autotune Choices Stats: 2025-12-04T10:01:22.9710137Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9710650Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9710983Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9711552Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9712769Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9714058Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9716009Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9718035Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9719981Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9721485Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9723205Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9724528Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9726075Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9727718Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9727984Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:22.9728137Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9728212Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9728306Z unimplemented [] 2025-12-04T10:01:22.9728485Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9728799Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9730246Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9730337Z graph_break [] 2025-12-04T10:01:22.9730565Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9730680Z Autotune Choices Stats: 2025-12-04T10:01:22.9732787Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:22.9733195Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9733427Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9733750Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9735411Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9737081Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9738357Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9739827Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9741410Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9742932Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9743329Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:22.9743500Z Autotune Choices Stats: 2025-12-04T10:01:22.9745644Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.9746142Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9746510Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9747090Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9748390Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9749585Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9750765Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9751936Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9753152Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9754376Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9755854Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9757055Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9758234Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9759402Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9759655Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:22.9759802Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9759874Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9759941Z unimplemented [] 2025-12-04T10:01:22.9760059Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9760248Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9761452Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9761574Z graph_break [] 2025-12-04T10:01:22.9761705Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9761831Z Autotune Choices Stats: 2025-12-04T10:01:22.9763256Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9763556Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9763782Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9764161Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9765318Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9766457Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9767584Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9768716Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9769848Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9771050Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9771342Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:22.9771412Z Autotune Choices Stats: 2025-12-04T10:01:22.9772861Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9773342Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9773677Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9774244Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9775437Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9776620Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9777785Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9778950Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9780209Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9782316Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9784368Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9786191Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9787677Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9788858Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9789115Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:22.9789264Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9789337Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9789401Z unimplemented [] 2025-12-04T10:01:22.9789511Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9789707Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9791013Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9791110Z graph_break [] 2025-12-04T10:01:22.9791246Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9791323Z Autotune Choices Stats: 2025-12-04T10:01:22.9792754Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9793053Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9793283Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9793608Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9794766Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9795911Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9797035Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9798175Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9799331Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9800555Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9800816Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:22.9800889Z Autotune Choices Stats: 2025-12-04T10:01:22.9802373Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.9802825Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9803162Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9803737Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9804920Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9806112Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9807282Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9808531Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9809751Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9810968Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9812136Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9813311Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9814472Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9815650Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9815900Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:22.9816043Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9816153Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9816218Z unimplemented [] 2025-12-04T10:01:22.9816338Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9816564Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9817796Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9817857Z graph_break [] 2025-12-04T10:01:22.9817989Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9818063Z Autotune Choices Stats: 2025-12-04T10:01:22.9819509Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9819766Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9819992Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9820311Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9821458Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9822592Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9823719Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9824852Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9826057Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9827297Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9827595Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:22.9827671Z Autotune Choices Stats: 2025-12-04T10:01:22.9829125Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9829572Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9829907Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9830470Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9831664Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9832869Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9834084Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9835321Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9836555Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9837721Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9838891Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9840057Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9841224Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9842391Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9842711Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:22.9842856Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9842964Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9843026Z unimplemented [] 2025-12-04T10:01:22.9843136Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9843324Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9844535Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9844602Z graph_break [] 2025-12-04T10:01:22.9844765Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9844843Z Autotune Choices Stats: 2025-12-04T10:01:22.9846253Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9846509Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9846737Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9847064Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9848209Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9849367Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9855532Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9856945Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9858132Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9859348Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9859610Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:22.9859691Z Autotune Choices Stats: 2025-12-04T10:01:22.9861145Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9861601Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9861938Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9862499Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9863703Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9864917Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9866129Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9867402Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9868609Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9869780Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9870941Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9872107Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9873271Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9874525Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9874814Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:22.9874961Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9875036Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9875102Z unimplemented [] 2025-12-04T10:01:22.9875215Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9875412Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9876663Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9876735Z graph_break [] 2025-12-04T10:01:22.9876869Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9876943Z Autotune Choices Stats: 2025-12-04T10:01:22.9878360Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9878618Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9878842Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9879164Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9880304Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9881434Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9882629Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9883792Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9884950Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9886090Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9886339Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:22.9886413Z Autotune Choices Stats: 2025-12-04T10:01:22.9887849Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9888298Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9888633Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9889196Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9890368Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9891602Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9892795Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9893989Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9895163Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9896337Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9897498Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9898663Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9899863Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9901085Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9901333Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:22.9901478Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9901550Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9901615Z unimplemented [] 2025-12-04T10:01:22.9901779Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9901970Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9903165Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9903235Z graph_break [] 2025-12-04T10:01:22.9903367Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9903443Z Autotune Choices Stats: 2025-12-04T10:01:22.9904848Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9905102Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9905324Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9905646Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9906786Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9908019Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9909210Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9910326Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9911479Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9912610Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9912860Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:22.9912933Z Autotune Choices Stats: 2025-12-04T10:01:22.9914370Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.9914825Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9915193Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9915857Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9917247Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9918447Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9919634Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9920801Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9921963Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9923125Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9924297Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9925529Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9926940Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9928135Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9928426Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:22.9928567Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9928645Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9928710Z unimplemented [] 2025-12-04T10:01:22.9928825Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9929018Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9930222Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9930299Z graph_break [] 2025-12-04T10:01:22.9930435Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9930509Z Autotune Choices Stats: 2025-12-04T10:01:22.9931910Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9932169Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9932394Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9932707Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9933853Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9935065Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9936243Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9937402Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9938526Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9939662Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9939908Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:22.9939983Z Autotune Choices Stats: 2025-12-04T10:01:22.9941427Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:22.9941872Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9942204Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9942832Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9944033Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9945237Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9946391Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9947614Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9948781Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9949936Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9951095Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9952317Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9953510Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9954730Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9954978Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:22.9955109Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9955377Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9955481Z unimplemented [] 2025-12-04T10:01:22.9955595Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9955784Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9956984Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9957058Z graph_break [] 2025-12-04T10:01:22.9957189Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9957260Z Autotune Choices Stats: 2025-12-04T10:01:22.9958661Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9958915Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9959137Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9959444Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9960720Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9961888Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9963052Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9964180Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9965307Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9966434Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9966683Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:22.9966756Z Autotune Choices Stats: 2025-12-04T10:01:22.9968193Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:22.9968678Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9969067Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9969660Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9970831Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9972032Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9973197Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9974368Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9975528Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:22.9976699Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9977898Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9979163Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:22.9980368Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9981528Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:22.9981778Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:22.9981914Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:22.9981994Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:22.9982060Z unimplemented [] 2025-12-04T10:01:22.9982170Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:22.9982358Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:22.9983561Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:22.9983635Z graph_break [] 2025-12-04T10:01:22.9983768Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:22.9983842Z Autotune Choices Stats: 2025-12-04T10:01:22.9985251Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:22.9985544Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9985804Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9986117Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9987335Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9988504Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9989632Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9990764Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:22.9991890Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:22.9993026Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:22.9993277Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:22.9993350Z Autotune Choices Stats: 2025-12-04T10:01:22.9994819Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:22.9995384Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:22.9995782Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:22.9996453Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:22.9997829Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:22.9998997Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0000160Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0001324Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0002485Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0003689Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0004905Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0006318Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0007564Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0008723Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.0008972Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:23.0009105Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0009180Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0009243Z unimplemented [] 2025-12-04T10:01:23.0009346Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0009536Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0010735Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0010805Z graph_break [] 2025-12-04T10:01:23.0010933Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0011007Z Autotune Choices Stats: 2025-12-04T10:01:23.0012443Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.0012760Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0012981Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0013292Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0014443Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0015606Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0016734Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0017872Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0018999Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0020133Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0020380Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:23.0020488Z Autotune Choices Stats: 2025-12-04T10:01:23.0021956Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.0022431Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0022764Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0023358Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0024530Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0025713Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0026870Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0028080Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0029243Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0030468Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0031663Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0032870Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0034037Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0035192Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0035458Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:23.0035592Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0035669Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0035734Z unimplemented [] 2025-12-04T10:01:23.0035835Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0036032Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0037226Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0037295Z graph_break [] 2025-12-04T10:01:23.0037427Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0037536Z Autotune Choices Stats: 2025-12-04T10:01:23.0038970Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.0039255Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0039476Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0039786Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0040964Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0042083Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0043214Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0044341Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0045479Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0046656Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0046936Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:23.0047046Z Autotune Choices Stats: 2025-12-04T10:01:23.0048473Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.0048922Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0049289Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0050020Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0052022Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0054047Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0056256Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0058225Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0060337Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0062097Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0063345Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0064542Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0065731Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0066896Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.0067166Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:23.0067402Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0067483Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0067549Z unimplemented [] 2025-12-04T10:01:23.0067657Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0067858Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0069101Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0069206Z graph_break [] 2025-12-04T10:01:23.0069345Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0069473Z Autotune Choices Stats: 2025-12-04T10:01:23.0070900Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.0071158Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0071419Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0071735Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0072897Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0074031Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0075158Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0076291Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0077403Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0078605Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0078904Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:23.0078981Z Autotune Choices Stats: 2025-12-04T10:01:23.0080464Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.0080915Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0081249Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0081813Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0082995Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0084172Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0085334Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0086535Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0087756Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0088956Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0090124Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0091287Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0092455Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0093617Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.0093877Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:23.0094017Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0094096Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0094161Z unimplemented [] 2025-12-04T10:01:23.0094272Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0094506Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0095979Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0096086Z graph_break [] 2025-12-04T10:01:23.0096218Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0096287Z Autotune Choices Stats: 2025-12-04T10:01:23.0097744Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.0097997Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0098226Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0098542Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0099718Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0100841Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0101984Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0103115Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0104321Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0105492Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0105741Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:23.0105821Z Autotune Choices Stats: 2025-12-04T10:01:23.0107368Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:23.0107824Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0108163Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0108725Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0109907Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0111087Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0112252Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0113476Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0114662Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0115869Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0117036Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0118195Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0119356Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0120515Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.0120807Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:23.0120986Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:23.0121148Z Traceback (most recent call last): 2025-12-04T10:01:23.0121451Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:23.0121553Z self.assertTrue( 2025-12-04T10:01:23.0121760Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:23.0121843Z raise self.failureException(msg) 2025-12-04T10:01:23.0122092Z AssertionError: False is not true : Log file /tmp/tmpbvb4g54s/flex_attention_configs.json was not created 2025-12-04T10:01:23.0122100Z 2025-12-04T10:01:23.0122252Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:23.0122509Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:23.0122515Z 2025-12-04T10:01:23.0122686Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:23.0122822Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0122902Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0123008Z unimplemented [] 2025-12-04T10:01:23.0123118Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0124330Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:23.0124519Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0124585Z graph_break [] 2025-12-04T10:01:23.0124718Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0125725Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:23.0125818Z current_size = base.storage().size() 2025-12-04T10:01:23.0125887Z Autotune Choices Stats: 2025-12-04T10:01:23.0127302Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:23.0127557Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0127799Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0128111Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0129286Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0130464Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0131623Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0132743Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0133861Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0134982Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0135235Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:23.0135304Z Autotune Choices Stats: 2025-12-04T10:01:23.0136739Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.0137181Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0137573Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0138157Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0139367Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0140567Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0141732Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0142905Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0144062Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0145224Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0146380Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0147660Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0148854Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0150041Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0150296Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:23.0150432Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0150507Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0150578Z unimplemented [] 2025-12-04T10:01:23.0150682Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0150875Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0152081Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0152144Z graph_break [] 2025-12-04T10:01:23.0152281Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0152354Z Autotune Choices Stats: 2025-12-04T10:01:23.0153764Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.0154012Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0154239Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0154593Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0156016Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0157196Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0158386Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0159509Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0160638Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0161754Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0162011Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:23.0162082Z Autotune Choices Stats: 2025-12-04T10:01:23.0163526Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.0164054Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0164428Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0164990Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0166452Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0167675Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0169179Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0171114Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0173098Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0175152Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0177322Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0179357Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0180600Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0182253Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0182525Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:23.0182676Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0182754Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0182828Z unimplemented [] 2025-12-04T10:01:23.0182938Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0183139Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0184802Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0184881Z graph_break [] 2025-12-04T10:01:23.0185035Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0185152Z Autotune Choices Stats: 2025-12-04T10:01:23.0187031Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.0187596Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0188031Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0188399Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0189978Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0191269Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0192716Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0194299Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0195746Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0196981Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0197406Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:23.0197518Z Autotune Choices Stats: 2025-12-04T10:01:23.0199383Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.0200138Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0200479Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0201373Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0202715Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0204661Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0206611Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0208231Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0209382Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0210586Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0211816Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0213016Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0214180Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.0215339Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0215599Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:23.0215740Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0215814Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0215889Z unimplemented [] 2025-12-04T10:01:23.0215996Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0216196Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0217394Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0217459Z graph_break [] 2025-12-04T10:01:23.0217599Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0217670Z Autotune Choices Stats: 2025-12-04T10:01:23.0219110Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.0219428Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0219661Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0219973Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0221149Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0222274Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0223412Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0224527Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0225653Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0226770Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0227063Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:23.0227137Z Autotune Choices Stats: 2025-12-04T10:01:23.0228706Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.0229182Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0229521Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0230114Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0231283Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0232449Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0233609Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0234770Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0235957Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0237147Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0238335Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0239546Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0240705Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0241861Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.0242116Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:23.0242254Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0242326Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0242399Z unimplemented [] 2025-12-04T10:01:23.0242506Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0242700Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0243894Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0243995Z graph_break [] 2025-12-04T10:01:23.0244132Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0244202Z Autotune Choices Stats: 2025-12-04T10:01:23.0245668Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.0245956Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0246188Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0246515Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0247686Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0248808Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0249956Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0251085Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0252202Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0253348Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0253668Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:23.0253739Z Autotune Choices Stats: 2025-12-04T10:01:23.0255679Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.0256649Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0257084Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0257723Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0258922Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0260088Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0261247Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0262402Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0263651Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0265476Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0269879Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0274146Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0278390Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.0282559Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0285196Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:23.0286043Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0286541Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0286887Z unimplemented [] 2025-12-04T10:01:23.0287244Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0287882Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0290283Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0291692Z graph_break [] 2025-12-04T10:01:23.0291932Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0292231Z Autotune Choices Stats: 2025-12-04T10:01:23.0293930Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.0295641Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0296246Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0296878Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0298412Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0300727Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0303025Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0305322Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0307698Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0310094Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0311559Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:23.0311949Z Autotune Choices Stats: 2025-12-04T10:01:23.0313538Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.0315489Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0316331Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0317296Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0319109Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0321493Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0323873Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0326301Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0328739Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0331152Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0333530Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0335906Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0338277Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0340661Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0342132Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:23.0342594Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0342893Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0343128Z unimplemented [] 2025-12-04T10:01:23.0343334Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0343698Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0345215Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0346812Z graph_break [] 2025-12-04T10:01:23.0347080Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0347480Z Autotune Choices Stats: 2025-12-04T10:01:23.0349068Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.0350773Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0351308Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0351917Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0353439Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0355977Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0358288Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0360586Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0363029Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0365394Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0366834Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:23.0367225Z Autotune Choices Stats: 2025-12-04T10:01:23.0368851Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.0370792Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0371623Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0372579Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0374382Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0377006Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0379417Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0381832Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0384267Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0386649Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0389110Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0391485Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0393857Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.0396229Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0397744Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:23.0398244Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0398531Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0398767Z unimplemented [] 2025-12-04T10:01:23.0398968Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0399349Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0400800Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0402116Z graph_break [] 2025-12-04T10:01:23.0402335Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0402615Z Autotune Choices Stats: 2025-12-04T10:01:23.0404172Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:23.0405887Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0406425Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0407027Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0408543Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0410851Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0413149Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0415514Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0417833Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0420162Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0421595Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:23.0421982Z Autotune Choices Stats: 2025-12-04T10:01:23.0423548Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.0425486Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0426320Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0427332Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0429126Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0431506Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0433946Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0436362Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0438767Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0441144Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0443510Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0445893Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0448248Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0450658Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.0452189Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:23.0452645Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0452919Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0453112Z unimplemented [] 2025-12-04T10:01:23.0453313Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0453662Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0455169Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0456710Z graph_break [] 2025-12-04T10:01:23.0456941Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0457215Z Autotune Choices Stats: 2025-12-04T10:01:23.0458733Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:23.0460439Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0460981Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0461596Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0463124Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0465433Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0467862Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0470266Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0472621Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0474941Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0476376Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:23.0476757Z Autotune Choices Stats: 2025-12-04T10:01:23.0478305Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.0480235Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0481070Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0482418Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0485590Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0489188Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0491691Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0494207Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0496634Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0499012Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0501401Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0503782Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0506209Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0508704Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.0510238Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:23.0510702Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0510995Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0511187Z unimplemented [] 2025-12-04T10:01:23.0511390Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0511760Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0513248Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0514560Z graph_break [] 2025-12-04T10:01:23.0514790Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0515078Z Autotune Choices Stats: 2025-12-04T10:01:23.0516598Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.0518306Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0518848Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0519458Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0520993Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0523333Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0525713Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0528076Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0530424Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0532728Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0534165Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:23.0534557Z Autotune Choices Stats: 2025-12-04T10:01:23.0536103Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.0538156Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0538988Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0539946Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0541789Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0544243Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0546646Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0549120Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0551492Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0553866Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0556477Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0558860Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0561372Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0563789Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.0565259Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:23.0565818Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0566122Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0566322Z unimplemented [] 2025-12-04T10:01:23.0566519Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0566883Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0568378Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0569693Z graph_break [] 2025-12-04T10:01:23.0569914Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0570197Z Autotune Choices Stats: 2025-12-04T10:01:23.0571719Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:23.0573419Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0573965Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0574570Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0576097Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0578493Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0580811Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0583145Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0585455Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0587823Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0589269Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:23.0589657Z Autotune Choices Stats: 2025-12-04T10:01:23.0591214Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.0593154Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0594032Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0595046Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0596887Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0599307Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0601731Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0604116Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0606490Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0608869Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0611242Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0613703Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0616108Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.0618512Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0619979Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:23.0620437Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0620711Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0620908Z unimplemented [] 2025-12-04T10:01:23.0621112Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0621470Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0622916Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0624222Z graph_break [] 2025-12-04T10:01:23.0624446Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0624717Z Autotune Choices Stats: 2025-12-04T10:01:23.0626248Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.0628031Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0628568Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0629176Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0630755Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0633126Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0635488Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0637813Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0640123Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0642422Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0643856Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:23.0644242Z Autotune Choices Stats: 2025-12-04T10:01:23.0645794Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.0647731Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0648642Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0649630Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0651435Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0653867Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0656694Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0659124Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0661515Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0663889Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0666351Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0668898Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0671330Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.0673708Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0675184Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:23.0675647Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0675934Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0676124Z unimplemented [] 2025-12-04T10:01:23.0676344Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0676715Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0678168Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0679476Z graph_break [] 2025-12-04T10:01:23.0679701Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0679980Z Autotune Choices Stats: 2025-12-04T10:01:23.0681503Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:23.0683226Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0683804Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0684445Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0685967Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0688310Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0690644Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0692944Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0695247Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0697550Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0698986Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:23.0699377Z Autotune Choices Stats: 2025-12-04T10:01:23.0700933Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.0702941Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0703828Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0704785Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0706625Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0709088Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0711477Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0713852Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0716245Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0718619Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0721059Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0723458Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0725864Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0728228Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0729694Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:23.0730147Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0730445Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0730646Z unimplemented [] 2025-12-04T10:01:23.0730847Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0731210Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0732657Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0733972Z graph_break [] 2025-12-04T10:01:23.0734195Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0734468Z Autotune Choices Stats: 2025-12-04T10:01:23.0735978Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.0737756Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0738331Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0738933Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0740493Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0742836Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0745165Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0747558Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0749898Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0752199Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0753623Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:23.0754011Z Autotune Choices Stats: 2025-12-04T10:01:23.0755952Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.0757965Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0758808Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0759812Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0761624Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0764011Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0766416Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0768797Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0771159Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0773635Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0776049Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0778504Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0780879Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0783300Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.0784779Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:23.0785243Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0785528Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0785732Z unimplemented [] 2025-12-04T10:01:23.0785940Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0786302Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0787841Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0789159Z graph_break [] 2025-12-04T10:01:23.0789389Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0789670Z Autotune Choices Stats: 2025-12-04T10:01:23.0791288Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.0793032Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0793574Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0794186Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0795748Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0798054Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0800347Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0802648Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0804950Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0807255Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0808755Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:23.0809145Z Autotune Choices Stats: 2025-12-04T10:01:23.0810719Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.0812690Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0813599Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0814564Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0816365Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0818749Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0821116Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0823487Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0825901Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0828397Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0830800Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0833163Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0835530Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.0837897Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0839358Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:23.0839816Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0840101Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0840290Z unimplemented [] 2025-12-04T10:01:23.0840494Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0840860Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0842312Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0843689Z graph_break [] 2025-12-04T10:01:23.0843942Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0844220Z Autotune Choices Stats: 2025-12-04T10:01:23.0845805Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.0847541Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0848080Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0848723Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0855953Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0858445Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0860749Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0863048Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0865339Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0867893Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0869393Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:23.0869800Z Autotune Choices Stats: 2025-12-04T10:01:23.0871445Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.0873415Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0874252Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0875207Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0877019Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0879387Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0881754Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0884113Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0886914Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0889357Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0891754Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0894114Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0896504Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0898874Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.0900340Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:23.0900801Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0901091Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0901288Z unimplemented [] 2025-12-04T10:01:23.0901494Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0901867Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0903414Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0904762Z graph_break [] 2025-12-04T10:01:23.0904989Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0905273Z Autotune Choices Stats: 2025-12-04T10:01:23.0906846Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.0908689Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0909233Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0909839Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0911357Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0913657Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0915958Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0918244Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0920570Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0922923Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0924347Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:23.0924740Z Autotune Choices Stats: 2025-12-04T10:01:23.0926324Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.0928310Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0929157Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0930109Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0931912Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0934285Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0936659Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0939106Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0941510Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0943901Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0946272Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0948724Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0951081Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0953435Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.0954895Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:23.0955711Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.0955993Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.0956194Z unimplemented [] 2025-12-04T10:01:23.0956477Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.0956844Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.0958344Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.0959648Z graph_break [] 2025-12-04T10:01:23.0959877Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.0960159Z Autotune Choices Stats: 2025-12-04T10:01:23.0961733Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.0963446Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0963984Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0964593Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0966153Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0968462Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0970751Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.0973109Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.0975449Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0977791Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0979255Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:23.0979648Z Autotune Choices Stats: 2025-12-04T10:01:23.0981186Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.0983118Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.0983964Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.0984908Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.0986701Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.0989229Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0991698Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.0994121Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.0996539Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.0998925Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1001304Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1003678Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1006062Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1008447Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1009990Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:23.1010491Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1010782Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1010974Z unimplemented [] 2025-12-04T10:01:23.1011180Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1011549Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1013005Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1014328Z graph_break [] 2025-12-04T10:01:23.1014590Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1014869Z Autotune Choices Stats: 2025-12-04T10:01:23.1016392Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1018100Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1018680Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1019290Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1020827Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1023134Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1025419Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1027882Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1030215Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1032556Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1033983Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:23.1034386Z Autotune Choices Stats: 2025-12-04T10:01:23.1035935Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.1037867Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1038706Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1039661Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1041463Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1043887Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1046360Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1047554Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1048716Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1049875Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1051032Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1052194Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1053360Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1054595Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1054877Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:23.1055013Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1055092Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1055155Z unimplemented [] 2025-12-04T10:01:23.1055512Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1055712Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1056979Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1057055Z graph_break [] 2025-12-04T10:01:23.1057189Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1057265Z Autotune Choices Stats: 2025-12-04T10:01:23.1058666Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1058924Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1059144Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1059457Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1060594Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1061707Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1062945Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1064110Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1065258Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1066377Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1066626Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:23.1066700Z Autotune Choices Stats: 2025-12-04T10:01:23.1068219Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.1068669Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1069006Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1069572Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1070735Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1071972Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1073156Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1074386Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1075547Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1076729Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1077898Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1079063Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1080273Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1081496Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1081748Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:23.1081883Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1081960Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1082025Z unimplemented [] 2025-12-04T10:01:23.1082167Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1082362Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1083546Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1083617Z graph_break [] 2025-12-04T10:01:23.1083756Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1083829Z Autotune Choices Stats: 2025-12-04T10:01:23.1085221Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1085476Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1085698Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1086011Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1087145Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1088293Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1089521Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1090677Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1091789Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1092911Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1093159Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:23.1093232Z Autotune Choices Stats: 2025-12-04T10:01:23.1094666Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.1095112Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1095439Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1096001Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1097239Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1098437Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1099624Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1100824Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1101984Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1103135Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1104326Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1105510Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1106697Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1107986Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1108251Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:23.1108393Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1108474Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1108539Z unimplemented [] 2025-12-04T10:01:23.1108644Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1108837Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1110027Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1110101Z graph_break [] 2025-12-04T10:01:23.1110229Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1110305Z Autotune Choices Stats: 2025-12-04T10:01:23.1111710Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1111967Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1112190Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1112503Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1113639Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1114823Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1115981Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1117147Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1118264Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1119388Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1119637Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:23.1119712Z Autotune Choices Stats: 2025-12-04T10:01:23.1121136Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:23.1121583Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1121921Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1122548Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1123748Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1124936Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1126099Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1127257Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1128400Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1129558Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1130714Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1131933Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1133131Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1134315Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1134569Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:23.1134736Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1134816Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1134880Z unimplemented [] 2025-12-04T10:01:23.1134982Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1135177Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1136377Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1136447Z graph_break [] 2025-12-04T10:01:23.1136574Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1136645Z Autotune Choices Stats: 2025-12-04T10:01:23.1138052Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1138297Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1138519Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1138869Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1140036Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1141192Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1142351Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1143467Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1144580Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1145701Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1145957Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:23.1146034Z Autotune Choices Stats: 2025-12-04T10:01:23.1147518Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.1147996Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1148361Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1148954Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1150116Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1151313Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1152465Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1153629Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1154777Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1156164Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1157453Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1158651Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1159856Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1161009Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1161271Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:23.1161407Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1161489Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1161553Z unimplemented [] 2025-12-04T10:01:23.1161658Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1161853Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1163035Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1163104Z graph_break [] 2025-12-04T10:01:23.1163235Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1163308Z Autotune Choices Stats: 2025-12-04T10:01:23.1164710Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1164996Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1165265Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1165614Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1166747Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1167894Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1169016Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1170145Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1171256Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1172376Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1172621Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:23.1172694Z Autotune Choices Stats: 2025-12-04T10:01:23.1174165Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.1174695Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1175028Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1175588Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1176790Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1177948Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1179102Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1180271Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1181423Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1182614Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1183835Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1185018Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1186172Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1187390Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1187647Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:23.1187777Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1187862Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1187931Z unimplemented [] 2025-12-04T10:01:23.1188036Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1188226Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1189418Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1189492Z graph_break [] 2025-12-04T10:01:23.1189624Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1189694Z Autotune Choices Stats: 2025-12-04T10:01:23.1191155Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1191466Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1191693Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1192005Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1193166Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1194282Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1195409Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1196528Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1197653Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1198774Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1199054Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:23.1199129Z Autotune Choices Stats: 2025-12-04T10:01:23.1200593Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.1201068Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1201401Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1201990Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1203163Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1204326Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1205491Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1206651Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1207799Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1209181Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1210374Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1211557Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1212712Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1213867Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1214121Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:23.1214257Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1214334Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1214397Z unimplemented [] 2025-12-04T10:01:23.1214503Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1214711Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1215896Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1215963Z graph_break [] 2025-12-04T10:01:23.1216134Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1216203Z Autotune Choices Stats: 2025-12-04T10:01:23.1217637Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1217918Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1218140Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1218452Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1219610Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1220721Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1221843Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1223002Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1224142Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1225340Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1225653Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:23.1225721Z Autotune Choices Stats: 2025-12-04T10:01:23.1227154Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:23.1227694Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1228035Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1228592Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1229760Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1230924Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1232079Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1233237Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1234450Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1235643Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1236853Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1238002Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1239151Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1240300Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1240568Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:23.1240703Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1240781Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1240845Z unimplemented [] 2025-12-04T10:01:23.1240950Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1241143Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1242362Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1242466Z graph_break [] 2025-12-04T10:01:23.1242633Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1242701Z Autotune Choices Stats: 2025-12-04T10:01:23.1244099Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.1244349Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1244605Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1244919Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1246055Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1247162Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1248281Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1249389Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1250503Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1251686Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1251963Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:23.1252029Z Autotune Choices Stats: 2025-12-04T10:01:23.1253510Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.1253957Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1254283Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1254840Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1256359Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1257538Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1258705Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1259945Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1261190Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1262395Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1263556Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1264712Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1265868Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1267067Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1267407Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:23.1267584Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:23.1267672Z Traceback (most recent call last): 2025-12-04T10:01:23.1267979Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:23.1268090Z self.assertTrue( 2025-12-04T10:01:23.1268297Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:23.1268413Z raise self.failureException(msg) 2025-12-04T10:01:23.1268658Z AssertionError: False is not true : Log file /tmp/tmpq9_kmn9n/flex_attention_configs.json was not created 2025-12-04T10:01:23.1268723Z 2025-12-04T10:01:23.1268880Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:23.1269142Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:23.1269147Z 2025-12-04T10:01:23.1269322Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:23.1269459Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1269533Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1269605Z unimplemented [] 2025-12-04T10:01:23.1269712Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1270970Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:23.1271164Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1271225Z graph_break [] 2025-12-04T10:01:23.1271361Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1272361Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:23.1272452Z current_size = base.storage().size() 2025-12-04T10:01:23.1272532Z Autotune Choices Stats: 2025-12-04T10:01:23.1273943Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:23.1274196Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1274423Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1274743Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1275883Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1277069Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1278214Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1279349Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1280463Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1281577Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1281833Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:23.1281902Z Autotune Choices Stats: 2025-12-04T10:01:23.1283339Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.1283781Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1284119Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1284676Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1285933Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1287122Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1288307Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1289456Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1290615Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1291766Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1292916Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1294102Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1295311Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1296507Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1296763Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:23.1296903Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1296974Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1297043Z unimplemented [] 2025-12-04T10:01:23.1297148Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1297382Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1298587Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1298653Z graph_break [] 2025-12-04T10:01:23.1298791Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1298861Z Autotune Choices Stats: 2025-12-04T10:01:23.1300284Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1300535Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1300763Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1301088Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1302266Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1303482Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1304642Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1305759Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1306886Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1308052Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1308312Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:23.1308383Z Autotune Choices Stats: 2025-12-04T10:01:23.1309832Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.1310277Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1310691Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1311259Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1312468Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1313659Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1314824Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1315981Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1317140Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1318301Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1319488Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1320674Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1321863Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1323058Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1323317Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:23.1323465Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1323537Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1323605Z unimplemented [] 2025-12-04T10:01:23.1323711Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1323902Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1325100Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1325163Z graph_break [] 2025-12-04T10:01:23.1325298Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1325365Z Autotune Choices Stats: 2025-12-04T10:01:23.1326782Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1327036Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1327259Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1327624Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1328804Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1329966Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1331124Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1332257Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1333386Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1334507Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1334759Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:23.1334827Z Autotune Choices Stats: 2025-12-04T10:01:23.1336276Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.1336804Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1337180Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1337744Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1338961Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1340175Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1341342Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1342500Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1343664Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1344826Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1346055Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1347309Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1348502Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1349663Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1349914Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:23.1350062Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1350134Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1350204Z unimplemented [] 2025-12-04T10:01:23.1350311Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1350497Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1351680Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1351744Z graph_break [] 2025-12-04T10:01:23.1351880Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1351948Z Autotune Choices Stats: 2025-12-04T10:01:23.1353349Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1353638Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1353894Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1354250Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1355529Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1356768Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1357915Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1359049Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1360178Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1361315Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1361568Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:23.1361637Z Autotune Choices Stats: 2025-12-04T10:01:23.1363141Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.1363666Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1364002Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1364567Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1365801Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1366974Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1368143Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1369351Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1370516Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1371721Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1372942Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1374140Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1375296Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1376457Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1376708Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:23.1376851Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1376929Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1376995Z unimplemented [] 2025-12-04T10:01:23.1377109Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1377298Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1378498Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1378563Z graph_break [] 2025-12-04T10:01:23.1378698Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1378767Z Autotune Choices Stats: 2025-12-04T10:01:23.1380204Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1380523Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1380746Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1381064Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1382237Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1383358Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1384483Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1385611Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1386740Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1387952Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1388253Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:23.1388322Z Autotune Choices Stats: 2025-12-04T10:01:23.1389799Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.1390277Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1390648Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1391207Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1392385Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1393541Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1394739Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1395900Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1397097Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1398280Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1399533Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1400692Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1401847Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1403008Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1403253Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:23.1403387Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1403458Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1403522Z unimplemented [] 2025-12-04T10:01:23.1403632Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1403819Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1405010Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1405109Z graph_break [] 2025-12-04T10:01:23.1405246Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1405314Z Autotune Choices Stats: 2025-12-04T10:01:23.1406750Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1407035Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1407256Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1407573Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1408735Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1409859Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1410976Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1412097Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1413220Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1414367Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1414968Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:23.1415037Z Autotune Choices Stats: 2025-12-04T10:01:23.1416479Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.1416959Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1417296Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1417861Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1419039Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1420211Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1421381Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1422545Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1423780Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1424970Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1426168Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1427410Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1428568Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1429769Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1430020Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:23.1430157Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1430227Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1430290Z unimplemented [] 2025-12-04T10:01:23.1430401Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1430587Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1431821Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1431971Z graph_break [] 2025-12-04T10:01:23.1432110Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1432189Z Autotune Choices Stats: 2025-12-04T10:01:23.1433596Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1433884Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1434108Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1434430Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1435572Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1436698Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1437811Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1438940Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1440097Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1441256Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1441538Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:23.1441609Z Autotune Choices Stats: 2025-12-04T10:01:23.1443076Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.1443523Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1443858Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1444437Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1445617Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1446794Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1447970Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1449202Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1450400Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1451588Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1452750Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1453912Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1455070Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1456419Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1456667Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:23.1456805Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1456942Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1457004Z unimplemented [] 2025-12-04T10:01:23.1457112Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1457299Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1458553Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1458661Z graph_break [] 2025-12-04T10:01:23.1458791Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1458867Z Autotune Choices Stats: 2025-12-04T10:01:23.1460314Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:23.1460569Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1460791Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1461107Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1462245Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1463373Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1464500Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1465662Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1466873Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1468101Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1468352Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:23.1468420Z Autotune Choices Stats: 2025-12-04T10:01:23.1469898Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.1470342Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1470676Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1471236Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1472414Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1473588Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1474790Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1476012Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1477240Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1478409Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1479579Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1480741Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1481905Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1483070Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1483346Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:23.1483513Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1483621Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1483684Z unimplemented [] 2025-12-04T10:01:23.1483794Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1483980Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1485171Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1485235Z graph_break [] 2025-12-04T10:01:23.1485365Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1485440Z Autotune Choices Stats: 2025-12-04T10:01:23.1486881Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:23.1487134Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1487356Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1487673Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1488804Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1489942Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1491060Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1492256Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1493474Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1494660Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1494911Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:23.1494981Z Autotune Choices Stats: 2025-12-04T10:01:23.1496422Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.1496864Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1497196Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1497759Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1498930Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1500105Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1501338Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1502529Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1503727Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1504890Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1506056Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1507294Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1508464Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1509664Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1509976Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:23.1510114Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1510185Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1510247Z unimplemented [] 2025-12-04T10:01:23.1510359Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1510544Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1511781Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1511848Z graph_break [] 2025-12-04T10:01:23.1511976Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1512050Z Autotune Choices Stats: 2025-12-04T10:01:23.1513459Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1513712Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1513929Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1514248Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1515386Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1516509Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1517671Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1518866Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1520015Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1521148Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1521392Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:23.1521465Z Autotune Choices Stats: 2025-12-04T10:01:23.1522915Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.1523356Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1523686Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1524259Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1525432Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1526672Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1527886Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1529075Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1530240Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1531408Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1532614Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1533781Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1534973Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1536215Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1536460Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:23.1536597Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1536666Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1536729Z unimplemented [] 2025-12-04T10:01:23.1536838Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1537060Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1538259Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1538321Z graph_break [] 2025-12-04T10:01:23.1538454Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1538529Z Autotune Choices Stats: 2025-12-04T10:01:23.1539931Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:23.1540181Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1540397Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1540714Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1541849Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1542972Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1544200Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1545356Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1546521Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1547734Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1547982Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:23.1548060Z Autotune Choices Stats: 2025-12-04T10:01:23.1549499Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.1549957Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1550324Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1550889Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1552108Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1553337Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1554547Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1555859Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1557028Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1558192Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1559359Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1560515Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1561812Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1563037Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1563333Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:23.1563476Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1563548Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1563612Z unimplemented [] 2025-12-04T10:01:23.1563721Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1563907Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1565096Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1565165Z graph_break [] 2025-12-04T10:01:23.1565298Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1565377Z Autotune Choices Stats: 2025-12-04T10:01:23.1566780Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1567031Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1567252Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1567570Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1568705Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1569912Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1571075Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1572240Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1573363Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1574492Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1574738Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:23.1574812Z Autotune Choices Stats: 2025-12-04T10:01:23.1576257Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.1576700Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1577030Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1577662Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1578842Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1580043Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1581243Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1582412Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1583571Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1584728Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1585890Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1587181Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1588435Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1589645Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1589892Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:23.1590029Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1590100Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1590164Z unimplemented [] 2025-12-04T10:01:23.1590272Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1590455Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1591643Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1591711Z graph_break [] 2025-12-04T10:01:23.1591840Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1591912Z Autotune Choices Stats: 2025-12-04T10:01:23.1593314Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:23.1593562Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1593780Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1594100Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1595329Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1596500Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1597704Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1598837Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1599966Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1601093Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1601338Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:23.1601410Z Autotune Choices Stats: 2025-12-04T10:01:23.1602851Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.1603292Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1603705Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1604297Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1605467Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1606671Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1607839Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1609020Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1610187Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1611346Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1612549Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1613771Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1614970Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1616131Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1616378Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:23.1616514Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1616596Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1616660Z unimplemented [] 2025-12-04T10:01:23.1616772Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1616957Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1618179Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1618248Z graph_break [] 2025-12-04T10:01:23.1618374Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1618454Z Autotune Choices Stats: 2025-12-04T10:01:23.1619858Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1620111Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1620375Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1620720Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1621928Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1623128Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1624261Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1625397Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1626531Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1627708Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1627954Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:23.1628032Z Autotune Choices Stats: 2025-12-04T10:01:23.1629533Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.1630064Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1630431Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1630991Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1632199Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1633373Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1634539Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1635707Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1636878Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1638032Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1639259Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1640447Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1641642Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1642809Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1643057Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:23.1643186Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1643260Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1643324Z unimplemented [] 2025-12-04T10:01:23.1643436Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1643629Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1644809Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1644880Z graph_break [] 2025-12-04T10:01:23.1645007Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1645081Z Autotune Choices Stats: 2025-12-04T10:01:23.1646513Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1646793Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1647045Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1647355Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1648490Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1649646Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1650768Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1651896Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1653023Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1654149Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1654391Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:23.1654503Z Autotune Choices Stats: 2025-12-04T10:01:23.1656169Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.1656681Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1657011Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1657624Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1658819Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1659999Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1661164Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1662341Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1663510Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1664756Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1665951Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1667137Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1668341Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1669505Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1669758Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:23.1669889Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1669970Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1670035Z unimplemented [] 2025-12-04T10:01:23.1670145Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1670335Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1671529Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1671600Z graph_break [] 2025-12-04T10:01:23.1671732Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1671862Z Autotune Choices Stats: 2025-12-04T10:01:23.1673297Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1673584Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1673805Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1674121Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1675310Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1676438Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1677562Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1678690Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1679822Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1680952Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1681268Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:23.1681377Z Autotune Choices Stats: 2025-12-04T10:01:23.1682821Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.1683263Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1683632Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1684198Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1685370Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1686539Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1687707Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1688870Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1690071Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1691318Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1692521Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1693682Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1694854Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1696029Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1696278Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:23.1696411Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1696495Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1696560Z unimplemented [] 2025-12-04T10:01:23.1696664Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1696857Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1698043Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1698153Z graph_break [] 2025-12-04T10:01:23.1698313Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1698420Z Autotune Choices Stats: 2025-12-04T10:01:23.1699815Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1700065Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1700287Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1700633Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1701780Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1702911Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1704040Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1705167Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1706287Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1707530Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1707808Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:23.1707888Z Autotune Choices Stats: 2025-12-04T10:01:23.1709382Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.1709828Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1710160Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1710723Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1711897Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1713075Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1714248Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1715454Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1716658Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1717890Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1719063Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1720224Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1721391Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1722557Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1722803Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:23.1722932Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1723011Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1723075Z unimplemented [] 2025-12-04T10:01:23.1723178Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1723367Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1724641Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1724746Z graph_break [] 2025-12-04T10:01:23.1724878Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1724950Z Autotune Choices Stats: 2025-12-04T10:01:23.1726384Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1726642Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1726867Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1727179Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1728326Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1729449Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1730592Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1731721Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1732881Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1734066Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1734317Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:23.1734393Z Autotune Choices Stats: 2025-12-04T10:01:23.1735864Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.1736311Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1736645Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1737210Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1738386Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1739557Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1740725Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1741961Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1743166Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1744358Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1745533Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1746692Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1747899Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1749061Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1749316Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:23.1749493Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1749572Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1749636Z unimplemented [] 2025-12-04T10:01:23.1749773Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1749973Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1751194Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1751267Z graph_break [] 2025-12-04T10:01:23.1751395Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1751469Z Autotune Choices Stats: 2025-12-04T10:01:23.1752916Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1753179Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1753412Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1753724Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1754865Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1756116Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1757252Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1758448Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1759621Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1760804Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1761097Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:23.1761177Z Autotune Choices Stats: 2025-12-04T10:01:23.1762629Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.1763075Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1763412Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1763975Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1765154Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1766331Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1767608Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1768804Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1769995Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1771169Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1772335Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1773492Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1774659Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1775857Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1776141Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:23.1776305Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1776381Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1776445Z unimplemented [] 2025-12-04T10:01:23.1776548Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1776741Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1777935Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1778040Z graph_break [] 2025-12-04T10:01:23.1778173Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1778240Z Autotune Choices Stats: 2025-12-04T10:01:23.1779646Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1779892Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1780119Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1780432Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1781566Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1782685Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1783816Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1785014Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1786189Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1787396Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1787646Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:23.1787721Z Autotune Choices Stats: 2025-12-04T10:01:23.1789158Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.1789602Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1789930Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1790492Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1791667Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1792867Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1794085Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1795325Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1796702Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1797988Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1799148Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1800306Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1801472Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1802696Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1802980Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:23.1803113Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1803191Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1803256Z unimplemented [] 2025-12-04T10:01:23.1803362Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1803553Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1804768Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1804840Z graph_break [] 2025-12-04T10:01:23.1804967Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1805036Z Autotune Choices Stats: 2025-12-04T10:01:23.1806448Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1806697Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1806925Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1807237Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1808384Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1809502Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1810700Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1811857Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1813009Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1814135Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1814382Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:23.1814457Z Autotune Choices Stats: 2025-12-04T10:01:23.1815908Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.1816359Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1816699Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1817261Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1818435Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1819699Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1820902Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1822099Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1823263Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1824432Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1825596Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1826752Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1828021Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1829213Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1829471Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:23.1829602Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1829677Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1829774Z unimplemented [] 2025-12-04T10:01:23.1829877Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1830070Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1831256Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1831333Z graph_break [] 2025-12-04T10:01:23.1831466Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1831535Z Autotune Choices Stats: 2025-12-04T10:01:23.1832938Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1833179Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1833409Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1833722Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1834868Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1836017Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1837197Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1838347Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1839479Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1840602Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1840846Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:23.1840918Z Autotune Choices Stats: 2025-12-04T10:01:23.1842351Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:23.1842798Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1843127Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1843682Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1844925Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1846121Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1847322Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1848494Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1849658Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1850822Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1851993Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1853212Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1854443Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1855825Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1856095Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:23.1856232Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1856309Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1856374Z unimplemented [] 2025-12-04T10:01:23.1856478Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1856674Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1857869Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1857938Z graph_break [] 2025-12-04T10:01:23.1858069Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1858138Z Autotune Choices Stats: 2025-12-04T10:01:23.1859559Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1859817Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1860048Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1860360Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1861557Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1862978Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1864151Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1865316Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1866445Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1867618Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1867867Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:23.1867940Z Autotune Choices Stats: 2025-12-04T10:01:23.1869397Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.1869842Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1870187Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1870812Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1876977Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1878246Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1879424Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1880587Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1881741Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1882897Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1884063Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1885288Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1886470Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1887671Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1887939Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:23.1888083Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1888166Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1888232Z unimplemented [] 2025-12-04T10:01:23.1888340Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1888540Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1889728Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1889799Z graph_break [] 2025-12-04T10:01:23.1889934Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1890004Z Autotune Choices Stats: 2025-12-04T10:01:23.1891423Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1891673Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1891899Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1892256Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1893441Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1894583Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1895744Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1896854Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1897988Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1899106Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1899358Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:23.1899427Z Autotune Choices Stats: 2025-12-04T10:01:23.1900867Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.1901372Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1901712Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1902308Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1903567Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1904731Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1905890Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1907041Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1908254Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1909412Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1910628Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1911818Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1913001Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1914154Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1914414Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:23.1914551Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1914630Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1914695Z unimplemented [] 2025-12-04T10:01:23.1914800Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1914995Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1916182Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1916251Z graph_break [] 2025-12-04T10:01:23.1916381Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1916456Z Autotune Choices Stats: 2025-12-04T10:01:23.1917849Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1918132Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1918404Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1918765Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1919899Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1921041Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1922158Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1923267Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1924386Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1925509Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1925755Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:23.1925822Z Autotune Choices Stats: 2025-12-04T10:01:23.1927296Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.1927795Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1928134Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1928687Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1929888Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1931046Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1932208Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1933360Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1934514Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1935700Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1936912Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1938099Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1939251Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1940407Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1940663Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:23.1940795Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1940866Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1940947Z unimplemented [] 2025-12-04T10:01:23.1941052Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1941249Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1942434Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1942504Z graph_break [] 2025-12-04T10:01:23.1942633Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1942701Z Autotune Choices Stats: 2025-12-04T10:01:23.1944136Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.1944448Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1944676Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1944992Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1946160Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1947316Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1950508Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1951640Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1952775Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1953898Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1954153Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:23.1954230Z Autotune Choices Stats: 2025-12-04T10:01:23.1955956Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:23.1956469Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1956805Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1957413Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1958576Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1959821Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1960975Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1962140Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1963290Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1964470Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1965658Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1966834Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1967986Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1969186Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.1969440Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:23.1969581Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1969663Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1969731Z unimplemented [] 2025-12-04T10:01:23.1969843Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1970037Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1971227Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1971300Z graph_break [] 2025-12-04T10:01:23.1971433Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1971509Z Autotune Choices Stats: 2025-12-04T10:01:23.1972940Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.1973228Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1973457Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1973770Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1974955Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1976076Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1977220Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.1978328Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1979447Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.1980599Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1980884Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:23.1980959Z Autotune Choices Stats: 2025-12-04T10:01:23.1982392Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.1982889Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1983227Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.1983781Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.1984995Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1986154Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1987358Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1988512Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1989714Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1990893Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1992075Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.1993220Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.1994405Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.1995553Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.1995802Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:23.1995938Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.1996014Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.1996078Z unimplemented [] 2025-12-04T10:01:23.1996186Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.1996373Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.1997597Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.1997704Z graph_break [] 2025-12-04T10:01:23.1997846Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.1997919Z Autotune Choices Stats: 2025-12-04T10:01:23.1999304Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.1999555Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.1999807Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2000118Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2001249Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2002399Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2003515Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2004628Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2005739Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2006887Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2007164Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:23.2007237Z Autotune Choices Stats: 2025-12-04T10:01:23.2008689Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:23.2009133Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2009461Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2010051Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2011212Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2012370Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2013520Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2014714Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2015897Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2017093Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2018247Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2019428Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2020576Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2021731Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2021975Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:23.2022149Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:23.2022249Z Traceback (most recent call last): 2025-12-04T10:01:23.2022560Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:23.2022633Z self.assertTrue( 2025-12-04T10:01:23.2022868Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:23.2022954Z raise self.failureException(msg) 2025-12-04T10:01:23.2023238Z AssertionError: False is not true : Log file /tmp/tmpr95ztumr/flex_attention_configs.json was not created 2025-12-04T10:01:23.2023244Z 2025-12-04T10:01:23.2023379Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:23.2023634Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:23.2023644Z 2025-12-04T10:01:23.2023810Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:23.2023945Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2024023Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2024086Z unimplemented [] 2025-12-04T10:01:23.2024193Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2025455Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:23.2025655Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2025723Z graph_break [] 2025-12-04T10:01:23.2025855Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2026904Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:23.2026990Z current_size = base.storage().size() 2025-12-04T10:01:23.2027062Z Autotune Choices Stats: 2025-12-04T10:01:23.2028503Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:23.2028756Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2028984Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2029299Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2030426Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2031572Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2032720Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2033853Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2034968Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2036127Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2036379Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:23.2036448Z Autotune Choices Stats: 2025-12-04T10:01:23.2037879Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.2038318Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2038652Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2039210Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2040408Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2041634Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2042829Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2043988Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2045174Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2046325Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2047467Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2048650Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2049851Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2051030Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2051290Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:23.2051422Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2051491Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2051559Z unimplemented [] 2025-12-04T10:01:23.2051661Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2051890Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2053082Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2053150Z graph_break [] 2025-12-04T10:01:23.2053278Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2053345Z Autotune Choices Stats: 2025-12-04T10:01:23.2054746Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2054994Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2055360Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2055685Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2056880Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2058037Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2059209Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2060312Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2061478Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2062594Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2062839Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:23.2062907Z Autotune Choices Stats: 2025-12-04T10:01:23.2064337Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2064775Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2065143Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2065724Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2066889Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2068113Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2069262Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2070554Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2072281Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2073443Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2074646Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2075837Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2077014Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2078157Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2078411Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:23.2078589Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2078659Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2078732Z unimplemented [] 2025-12-04T10:01:23.2078837Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2079031Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2080214Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2080277Z graph_break [] 2025-12-04T10:01:23.2080413Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2080479Z Autotune Choices Stats: 2025-12-04T10:01:23.2081886Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2082135Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2082359Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2082723Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2083854Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2084994Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2086149Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2087263Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2088415Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2089532Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2089786Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:23.2089855Z Autotune Choices Stats: 2025-12-04T10:01:23.2091277Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2091750Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2092117Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2092668Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2093861Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2095011Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2096220Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2097382Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2098523Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2099670Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2100840Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2102019Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2103196Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2104342Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2104633Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:23.2104767Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2104836Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2104907Z unimplemented [] 2025-12-04T10:01:23.2105012Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2105237Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2106649Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2106725Z graph_break [] 2025-12-04T10:01:23.2106884Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2106959Z Autotune Choices Stats: 2025-12-04T10:01:23.2108407Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2108695Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2108917Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2109264Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2110390Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2111536Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2112659Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2113831Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2114941Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2116057Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2116313Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:23.2116380Z Autotune Choices Stats: 2025-12-04T10:01:23.2117831Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2118303Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2118633Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2119183Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2120381Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2121528Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2122715Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2123870Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2125009Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2126190Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2127360Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2128525Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2129664Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2130831Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2131084Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:23.2131213Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2131287Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2131351Z unimplemented [] 2025-12-04T10:01:23.2131455Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2131657Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2132838Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2132910Z graph_break [] 2025-12-04T10:01:23.2133039Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2133106Z Autotune Choices Stats: 2025-12-04T10:01:23.2134529Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2134805Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2135027Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2135332Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2136501Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2137603Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2138744Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2139852Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2140955Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2142063Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2142306Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:23.2142412Z Autotune Choices Stats: 2025-12-04T10:01:23.2143835Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2144312Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2144690Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2145254Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2146420Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2147656Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2148810Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2149967Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2151145Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2152338Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2153517Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2154670Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2156007Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2157149Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2157399Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:23.2157530Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2157605Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2157670Z unimplemented [] 2025-12-04T10:01:23.2157774Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2157964Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2159163Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2159230Z graph_break [] 2025-12-04T10:01:23.2159356Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2159496Z Autotune Choices Stats: 2025-12-04T10:01:23.2160895Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2161190Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2161418Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2161777Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2162905Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2164014Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2165179Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2166297Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2167411Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2168559Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2168836Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:23.2168908Z Autotune Choices Stats: 2025-12-04T10:01:23.2170329Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2170806Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2171135Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2171688Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2172894Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2174060Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2175216Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2176380Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2177603Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2178801Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2179986Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2181131Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2182317Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2183469Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2183725Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:23.2183855Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2183935Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2184006Z unimplemented [] 2025-12-04T10:01:23.2184107Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2184300Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2185542Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2185641Z graph_break [] 2025-12-04T10:01:23.2185768Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2185840Z Autotune Choices Stats: 2025-12-04T10:01:23.2187332Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2187619Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2187849Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2188165Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2189310Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2190458Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2191593Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2192714Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2193876Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2195029Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2195273Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:23.2195345Z Autotune Choices Stats: 2025-12-04T10:01:23.2196811Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.2197262Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2197590Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2198188Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2199364Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2200535Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2201690Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2202883Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2204065Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2205302Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2206684Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2208016Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2209169Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2210326Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2210595Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:23.2210724Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2210797Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2210858Z unimplemented [] 2025-12-04T10:01:23.2210958Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2211202Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2212410Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2212510Z graph_break [] 2025-12-04T10:01:23.2212636Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2212704Z Autotune Choices Stats: 2025-12-04T10:01:23.2214145Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:23.2214391Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2214615Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2214961Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2216094Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2217213Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2218341Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2219457Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2220610Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2221764Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2222012Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:23.2222114Z Autotune Choices Stats: 2025-12-04T10:01:23.2223556Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2224032Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2224362Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2224922Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2226303Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2227591Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2228783Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2229978Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2231164Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2232321Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2233522Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2234675Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2235840Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2236994Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2237286Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:23.2237418Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2237526Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2237604Z unimplemented [] 2025-12-04T10:01:23.2237708Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2237900Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2239099Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2239167Z graph_break [] 2025-12-04T10:01:23.2239301Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2239408Z Autotune Choices Stats: 2025-12-04T10:01:23.2240811Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:23.2241106Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2241333Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2241653Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2242789Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2243921Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2245063Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2246246Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2247408Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2248565Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2248813Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:23.2248881Z Autotune Choices Stats: 2025-12-04T10:01:23.2250325Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2250803Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2251141Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2251699Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2252878Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2254033Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2255396Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2256626Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2257831Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2258993Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2260191Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2261350Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2262504Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2263708Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2264006Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:23.2264144Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2264220Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2264283Z unimplemented [] 2025-12-04T10:01:23.2264389Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2264583Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2265817Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2265885Z graph_break [] 2025-12-04T10:01:23.2266024Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2266095Z Autotune Choices Stats: 2025-12-04T10:01:23.2267540Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2267832Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2268054Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2268369Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2269503Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2270615Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2271773Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2272925Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2274098Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2275218Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2275507Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:23.2275576Z Autotune Choices Stats: 2025-12-04T10:01:23.2277014Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2277465Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2277802Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2278361Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2279530Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2280722Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2281917Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2283098Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2284252Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2285437Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2286595Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2287752Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2288946Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2290122Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2290375Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:23.2290508Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2290576Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2290645Z unimplemented [] 2025-12-04T10:01:23.2290750Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2290978Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2292164Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2292263Z graph_break [] 2025-12-04T10:01:23.2292391Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2292457Z Autotune Choices Stats: 2025-12-04T10:01:23.2293859Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:23.2294103Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2294324Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2294638Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2295786Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2296936Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2298060Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2299212Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2300368Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2301485Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2301771Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:23.2301839Z Autotune Choices Stats: 2025-12-04T10:01:23.2303259Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2303698Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2304035Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2304590Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2305793Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2306994Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2308219Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2309398Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2310661Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2311826Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2312986Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2314145Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2315543Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2316731Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2317019Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:23.2317154Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2317224Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2317295Z unimplemented [] 2025-12-04T10:01:23.2317399Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2317589Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2318770Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2318871Z graph_break [] 2025-12-04T10:01:23.2319003Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2319071Z Autotune Choices Stats: 2025-12-04T10:01:23.2320479Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2320733Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2320962Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2321277Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2322407Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2323559Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2324714Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2325864Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2326980Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2328127Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2328376Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:23.2328446Z Autotune Choices Stats: 2025-12-04T10:01:23.2329892Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2330328Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2330661Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2331250Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2332455Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2333647Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2334801Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2335956Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2337153Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2338310Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2339458Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2340661Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2342005Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2343412Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2343686Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:23.2343815Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2343883Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2343986Z unimplemented [] 2025-12-04T10:01:23.2344086Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2344273Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2345460Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2345523Z graph_break [] 2025-12-04T10:01:23.2345655Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2345722Z Autotune Choices Stats: 2025-12-04T10:01:23.2347119Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:23.2347428Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2347652Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2347964Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2349148Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2350304Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2351462Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2352577Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2353730Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2354849Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2355097Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:23.2355165Z Autotune Choices Stats: 2025-12-04T10:01:23.2356776Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.2357220Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2357618Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2358216Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2359389Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2360587Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2361753Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2362958Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2364109Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2365269Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2366452Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2367635Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2368825Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2369976Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2370286Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:23.2370416Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2370489Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2370561Z unimplemented [] 2025-12-04T10:01:23.2370661Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2370845Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2372028Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2372093Z graph_break [] 2025-12-04T10:01:23.2372227Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2372299Z Autotune Choices Stats: 2025-12-04T10:01:23.2373715Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2373960Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2374186Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2374539Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2375711Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2376853Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2377978Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2379127Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2380242Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2381358Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2381610Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:23.2381678Z Autotune Choices Stats: 2025-12-04T10:01:23.2383145Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2383616Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2383953Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2384507Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2385734Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2386891Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2388141Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2389303Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2390457Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2391647Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2392795Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2393992Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2395181Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2396333Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2396617Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:23.2396746Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2396813Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2396880Z unimplemented [] 2025-12-04T10:01:23.2396978Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2397158Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2398350Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2398413Z graph_break [] 2025-12-04T10:01:23.2398545Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2398612Z Autotune Choices Stats: 2025-12-04T10:01:23.2400064Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2400348Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2400572Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2400880Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2402009Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2403177Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2404297Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2405457Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2406583Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2407704Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2407952Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:23.2408018Z Autotune Choices Stats: 2025-12-04T10:01:23.2409495Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.2409970Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2410303Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2410892Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2412061Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2413250Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2414521Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2415689Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2416843Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2418030Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2419213Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2420402Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2421555Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2422764Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2423015Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:23.2423154Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2423226Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2423296Z unimplemented [] 2025-12-04T10:01:23.2423398Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2423584Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2424769Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2424832Z graph_break [] 2025-12-04T10:01:23.2424967Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2425035Z Autotune Choices Stats: 2025-12-04T10:01:23.2426472Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2426747Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2426971Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2427321Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2428483Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2429605Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2430756Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2431874Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2432995Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2434111Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2434395Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:23.2434514Z Autotune Choices Stats: 2025-12-04T10:01:23.2435954Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2436395Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2436765Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2437324Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2438497Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2439680Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2440839Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2441996Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2443191Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2444378Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2445608Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2446766Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2447954Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2449112Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2449365Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:23.2449497Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2449568Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2449641Z unimplemented [] 2025-12-04T10:01:23.2449741Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2449925Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2451116Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2451214Z graph_break [] 2025-12-04T10:01:23.2451352Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2451460Z Autotune Choices Stats: 2025-12-04T10:01:23.2452849Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2453094Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2453342Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2453670Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2454796Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2456247Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2457391Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2458514Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2459638Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2460834Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2461139Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:23.2461207Z Autotune Choices Stats: 2025-12-04T10:01:23.2462701Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2463145Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2463480Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2464081Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2465246Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2466400Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2467641Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2468857Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2470045Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2471230Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2472380Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2473564Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2474722Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2475878Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2476133Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:23.2476268Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2476337Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2476405Z unimplemented [] 2025-12-04T10:01:23.2476508Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2476696Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2477922Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2478016Z graph_break [] 2025-12-04T10:01:23.2478151Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2478218Z Autotune Choices Stats: 2025-12-04T10:01:23.2479668Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2479927Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2480154Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2480472Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2481604Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2482755Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2483876Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2484988Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2486141Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2487292Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2487541Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:23.2487607Z Autotune Choices Stats: 2025-12-04T10:01:23.2489087Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2489531Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2489901Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2490458Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2491627Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2492782Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2493936Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2495124Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2496315Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2497503Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2498651Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2499853Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2501002Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2502159Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2502408Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:23.2502542Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2502611Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2502712Z unimplemented [] 2025-12-04T10:01:23.2502818Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2503037Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2504224Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2504287Z graph_break [] 2025-12-04T10:01:23.2504418Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2504485Z Autotune Choices Stats: 2025-12-04T10:01:23.2505916Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2506165Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2506441Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2506763Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2507945Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2509067Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2510191Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2511335Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2512484Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2513632Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2513889Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:23.2513957Z Autotune Choices Stats: 2025-12-04T10:01:23.2515397Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.2515868Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2516205Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2516760Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2517928Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2519091Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2520282Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2521462Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2522649Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2523806Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2524985Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2526152Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2527302Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2528501Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2528780Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:23.2528910Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2528979Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2529046Z unimplemented [] 2025-12-04T10:01:23.2529146Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2529330Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2530570Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2530635Z graph_break [] 2025-12-04T10:01:23.2530766Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2530834Z Autotune Choices Stats: 2025-12-04T10:01:23.2532226Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2532511Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2532731Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2533049Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2534175Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2535292Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2536421Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2537570Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2538719Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2539868Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2540116Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:23.2540184Z Autotune Choices Stats: 2025-12-04T10:01:23.2541651Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2542086Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2542417Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2542976Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2544143Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2545327Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2546523Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2547752Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2548905Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2550086Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2551233Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2552395Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2553546Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2554734Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2555014Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:23.2555151Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2555544Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2555627Z unimplemented [] 2025-12-04T10:01:23.2555734Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2555924Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2557187Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2557259Z graph_break [] 2025-12-04T10:01:23.2557401Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2557469Z Autotune Choices Stats: 2025-12-04T10:01:23.2558917Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2559167Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2559390Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2559706Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2560842Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2561963Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2563127Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2564310Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2565465Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2566584Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2566880Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:23.2566947Z Autotune Choices Stats: 2025-12-04T10:01:23.2568386Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.2568826Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2569163Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2569719Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2570925Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2572341Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2573954Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2575154Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2576317Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2577504Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2578648Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2579804Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2580988Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2582225Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2582483Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:23.2582625Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2582729Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2582801Z unimplemented [] 2025-12-04T10:01:23.2582905Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2583093Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2584286Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2584380Z graph_break [] 2025-12-04T10:01:23.2584522Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2584595Z Autotune Choices Stats: 2025-12-04T10:01:23.2585998Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2586249Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2586471Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2586794Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2587992Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2589164Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2590312Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2591483Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2592604Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2593898Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2594154Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:23.2594223Z Autotune Choices Stats: 2025-12-04T10:01:23.2595669Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:23.2596115Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2596449Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2597004Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2598236Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2599416Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2600597Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2601749Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2602939Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2604094Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2605256Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2606448Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2607626Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2608805Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2609059Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:23.2609198Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2609269Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2609343Z unimplemented [] 2025-12-04T10:01:23.2609457Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2609645Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2610869Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2610935Z graph_break [] 2025-12-04T10:01:23.2611071Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2611142Z Autotune Choices Stats: 2025-12-04T10:01:23.2612532Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2612789Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2613012Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2613333Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2614496Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2615647Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2616790Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2617917Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2619039Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2620191Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2620440Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:23.2620521Z Autotune Choices Stats: 2025-12-04T10:01:23.2621961Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.2622403Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2622744Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2623340Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2624545Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2625730Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2626894Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2628166Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2629326Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2630498Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2631641Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2632830Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2634014Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2635199Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2635446Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:23.2635622Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2635702Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2635771Z unimplemented [] 2025-12-04T10:01:23.2635879Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2636070Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2637261Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2637324Z graph_break [] 2025-12-04T10:01:23.2637461Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2637533Z Autotune Choices Stats: 2025-12-04T10:01:23.2638928Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2639181Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2639400Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2639722Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2640890Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2642031Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2643177Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2644294Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2645454Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2646567Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2646819Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:23.2646888Z Autotune Choices Stats: 2025-12-04T10:01:23.2648321Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2648788Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2649156Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2649714Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2650918Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2652078Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2653266Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2654411Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2655720Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2656894Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2658123Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2659319Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2660530Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2661685Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2661983Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:23.2662125Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2662196Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2662258Z unimplemented [] 2025-12-04T10:01:23.2662365Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2662554Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2663759Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2663825Z graph_break [] 2025-12-04T10:01:23.2663961Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2664031Z Autotune Choices Stats: 2025-12-04T10:01:23.2665425Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2665684Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2665943Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2666299Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2667481Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2668638Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2669760Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2670911Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2672028Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2673150Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2673399Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:23.2673467Z Autotune Choices Stats: 2025-12-04T10:01:23.2674953Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2675431Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2675764Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2676318Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2677519Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2678679Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2679870Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2681021Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2682179Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2683362Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2684552Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2685745Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2686888Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2688080Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2688327Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:23.2688463Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2688534Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2688601Z unimplemented [] 2025-12-04T10:01:23.2688713Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2688901Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2690097Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2690162Z graph_break [] 2025-12-04T10:01:23.2690307Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2690378Z Autotune Choices Stats: 2025-12-04T10:01:23.2691826Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2692129Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2692351Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2692676Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2693851Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2694976Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2696135Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2697255Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2698370Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2699490Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2699741Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:23.2699808Z Autotune Choices Stats: 2025-12-04T10:01:23.2701284Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:23.2701760Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2702124Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2702692Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2703862Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2705062Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2706233Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2707437Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2708629Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2709780Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2710991Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2712151Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2713299Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2714490Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2714737Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:23.2714869Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2714940Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2715001Z unimplemented [] 2025-12-04T10:01:23.2715111Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2715312Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2716501Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2716564Z graph_break [] 2025-12-04T10:01:23.2716692Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2716763Z Autotune Choices Stats: 2025-12-04T10:01:23.2718235Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.2718520Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2718741Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2719059Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2720218Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2721339Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2722481Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2723600Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2724733Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2725898Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2726182Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:23.2726249Z Autotune Choices Stats: 2025-12-04T10:01:23.2727680Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.2728159Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2728488Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2729048Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2730251Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2731411Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2732570Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2733720Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2734912Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2736091Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2737276Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2738429Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2739619Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2740771Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2741034Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:23.2741176Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2741247Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2741310Z unimplemented [] 2025-12-04T10:01:23.2741418Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2741603Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2742831Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2742929Z graph_break [] 2025-12-04T10:01:23.2743061Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2743138Z Autotune Choices Stats: 2025-12-04T10:01:23.2744531Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.2744816Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2745039Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2745356Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2746491Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2747678Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2748794Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2749911Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2751070Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2752191Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2752473Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:23.2752541Z Autotune Choices Stats: 2025-12-04T10:01:23.2754011Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:23.2754465Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2754795Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2755570Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2756771Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2757969Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2759147Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2760375Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2761584Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2762789Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2763956Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2765185Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2766568Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2767798Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2768048Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:23.2768187Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2768259Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2768321Z unimplemented [] 2025-12-04T10:01:23.2768431Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2768654Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2769844Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2769938Z graph_break [] 2025-12-04T10:01:23.2770068Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2770141Z Autotune Choices Stats: 2025-12-04T10:01:23.2771598Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.2771853Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2772076Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2772425Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2773562Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2774685Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2775815Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2776950Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2778105Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2779259Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2779511Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:23.2779609Z Autotune Choices Stats: 2025-12-04T10:01:23.2781045Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:23.2781527Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2781856Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2782418Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2783589Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2784749Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2785954Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2787140Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2788407Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2789566Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2790762Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2791927Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2793087Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2794388Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2794638Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:23.2794862Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:23.2794977Z Traceback (most recent call last): 2025-12-04T10:01:23.2795277Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:23.2795358Z self.assertTrue( 2025-12-04T10:01:23.2795557Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:23.2795639Z raise self.failureException(msg) 2025-12-04T10:01:23.2795885Z AssertionError: False is not true : Log file /tmp/tmpi2u13ooi/flex_attention_configs.json was not created 2025-12-04T10:01:23.2795890Z 2025-12-04T10:01:23.2796028Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:23.2796299Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:23.2796303Z 2025-12-04T10:01:23.2796470Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:23.2796643Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2796724Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2796790Z unimplemented [] 2025-12-04T10:01:23.2796900Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2798100Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:23.2798329Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2798393Z graph_break [] 2025-12-04T10:01:23.2798525Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2799526Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:23.2799609Z current_size = base.storage().size() 2025-12-04T10:01:23.2799678Z Autotune Choices Stats: 2025-12-04T10:01:23.2801095Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:23.2801347Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2801568Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2801882Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2803061Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2804202Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2805360Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2806492Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2807643Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2808765Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2809016Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:23.2809097Z Autotune Choices Stats: 2025-12-04T10:01:23.2810547Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.2810987Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2811357Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2811952Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2813125Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2814324Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2815572Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2816982Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2818139Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2819302Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2820505Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2821710Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2822910Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2824061Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2824345Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:23.2824480Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2824558Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2824621Z unimplemented [] 2025-12-04T10:01:23.2824727Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2824923Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2826106Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2826176Z graph_break [] 2025-12-04T10:01:23.2826304Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2826372Z Autotune Choices Stats: 2025-12-04T10:01:23.2827873Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2828123Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2828352Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2828713Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2829878Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2831037Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2832160Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2833288Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2834430Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2835563Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2835820Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:23.2835894Z Autotune Choices Stats: 2025-12-04T10:01:23.2837333Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2837812Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2838176Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2838739Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2839936Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2841100Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2842295Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2843455Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2844615Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2845778Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2846979Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2848165Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2849353Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2850511Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2850839Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:23.2850970Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2851045Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2851107Z unimplemented [] 2025-12-04T10:01:23.2851209Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2851403Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2852588Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2852658Z graph_break [] 2025-12-04T10:01:23.2852786Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2852853Z Autotune Choices Stats: 2025-12-04T10:01:23.2854265Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2854566Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2854827Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2855139Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2856419Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2857604Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2858736Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2859898Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2861023Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2862146Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2862390Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:23.2862463Z Autotune Choices Stats: 2025-12-04T10:01:23.2863955Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2864438Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2864770Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2865361Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2866539Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2867745Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2868939Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2877149Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2878551Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2879983Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2881347Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2882707Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2884037Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2885393Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2885691Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:23.2885839Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2885925Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2885993Z unimplemented [] 2025-12-04T10:01:23.2886106Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2886322Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2887723Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2887801Z graph_break [] 2025-12-04T10:01:23.2887944Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2888019Z Autotune Choices Stats: 2025-12-04T10:01:23.2889675Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2890002Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2890260Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2890620Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2891960Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2893241Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2894556Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2895832Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2897097Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2898379Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2898698Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:23.2898769Z Autotune Choices Stats: 2025-12-04T10:01:23.2900446Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2900978Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2901380Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2902027Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2903368Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2904727Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2906052Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2907459Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2908810Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2910168Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2911545Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2912871Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2914226Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2915540Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2915832Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:23.2915974Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2916056Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2916123Z unimplemented [] 2025-12-04T10:01:23.2916230Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2916441Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2917835Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2917906Z graph_break [] 2025-12-04T10:01:23.2918080Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2918150Z Autotune Choices Stats: 2025-12-04T10:01:23.2919776Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2920066Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2920319Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2920708Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2922009Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2923313Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2924591Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2925875Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2927155Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2928463Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2928778Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:23.2928850Z Autotune Choices Stats: 2025-12-04T10:01:23.2930488Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2931043Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2931418Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2932055Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2933435Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2934762Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2936086Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2937411Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2938758Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2940111Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2941455Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2942790Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2944136Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.2945448Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2945742Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:23.2945883Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2945955Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2946030Z unimplemented [] 2025-12-04T10:01:23.2946141Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2946354Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2947841Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2947965Z graph_break [] 2025-12-04T10:01:23.2948101Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2948171Z Autotune Choices Stats: 2025-12-04T10:01:23.2949765Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2950087Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2950336Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2950699Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2951994Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2953296Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2954577Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2955997Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2957373Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2958716Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2959005Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:23.2959078Z Autotune Choices Stats: 2025-12-04T10:01:23.2960825Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.2961359Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2961734Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2962431Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2963782Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2965108Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2966447Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2967821Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2969178Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2970529Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2971844Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2973191Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2974516Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2975844Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2976133Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:23.2976274Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.2976349Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.2976419Z unimplemented [] 2025-12-04T10:01:23.2976526Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.2976787Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.2978215Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.2978285Z graph_break [] 2025-12-04T10:01:23.2978422Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.2978498Z Autotune Choices Stats: 2025-12-04T10:01:23.2980120Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.2980412Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2980661Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2981050Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2982341Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2983608Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2984892Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2986174Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.2987563Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.2988871Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2989201Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:23.2989274Z Autotune Choices Stats: 2025-12-04T10:01:23.2990910Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.2991467Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.2991835Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.2992472Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.2993815Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.2995133Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.2996502Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.2997865Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.2999220Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3000561Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3001917Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3003245Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3004570Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3005883Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3006203Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:23.3006375Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3006448Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3006518Z unimplemented [] 2025-12-04T10:01:23.3006624Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3006835Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3008229Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3008294Z graph_break [] 2025-12-04T10:01:23.3008468Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3008542Z Autotune Choices Stats: 2025-12-04T10:01:23.3010142Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:23.3010464Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3010713Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3011071Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3012360Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3013628Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3014912Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3016224Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3017541Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3018841Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3019131Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:23.3019198Z Autotune Choices Stats: 2025-12-04T10:01:23.3020838Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3021409Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3021778Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3022423Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3023760Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3025118Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3026450Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3027902Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3029246Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3030577Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3031924Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3033247Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3034573Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3035933Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3036255Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:23.3036394Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3036467Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3036538Z unimplemented [] 2025-12-04T10:01:23.3036643Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3036850Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3038279Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3038344Z graph_break [] 2025-12-04T10:01:23.3038482Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3038554Z Autotune Choices Stats: 2025-12-04T10:01:23.3040146Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:23.3040473Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3040717Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3041073Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3042365Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3043637Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3044945Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3046246Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3047559Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3048837Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3049147Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:23.3049217Z Autotune Choices Stats: 2025-12-04T10:01:23.3050851Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3051369Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3051737Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3052379Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3053723Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3055075Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3056620Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3058045Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3059368Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3060741Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3062064Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3063399Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3064764Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3066130Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3066424Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:23.3066577Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3066650Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3066725Z unimplemented [] 2025-12-04T10:01:23.3066870Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3067081Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3068517Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3068632Z graph_break [] 2025-12-04T10:01:23.3068780Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3068851Z Autotune Choices Stats: 2025-12-04T10:01:23.3070456Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3070748Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3070998Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3071355Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3072644Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3073963Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3075277Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3076582Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3077861Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3079143Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3079469Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:23.3079538Z Autotune Choices Stats: 2025-12-04T10:01:23.3081184Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3081711Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3082083Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3082724Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3084098Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3085468Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3086822Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3088152Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3089525Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3090853Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3092180Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3093495Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3094847Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3096226Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3096554Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:23.3096707Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3096780Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3096850Z unimplemented [] 2025-12-04T10:01:23.3096958Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3097167Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3098568Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3098669Z graph_break [] 2025-12-04T10:01:23.3098814Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3098881Z Autotune Choices Stats: 2025-12-04T10:01:23.3100484Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:23.3100778Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3101022Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3101380Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3102668Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3103985Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3105292Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3106596Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3107910Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3109228Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3109520Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:23.3109593Z Autotune Choices Stats: 2025-12-04T10:01:23.3111240Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3111762Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3112131Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3112812Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3114227Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3115586Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3116918Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3118296Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3119638Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3120973Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3122293Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3123657Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3125016Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3126380Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3126669Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:23.3126807Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3126911Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3126981Z unimplemented [] 2025-12-04T10:01:23.3127086Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3127294Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3128697Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3128760Z graph_break [] 2025-12-04T10:01:23.3128901Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3128967Z Autotune Choices Stats: 2025-12-04T10:01:23.3130579Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3130871Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3131121Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3131476Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3132815Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3134131Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3135445Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3136723Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3138043Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3139322Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3139613Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:23.3139681Z Autotune Choices Stats: 2025-12-04T10:01:23.3141329Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3141847Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3142247Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3142931Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3144265Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3145625Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3146956Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3148353Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3149676Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3150999Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3152377Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3153731Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3155079Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3156523Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3156872Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:23.3157012Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3157090Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3157163Z unimplemented [] 2025-12-04T10:01:23.3157269Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3157476Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3158872Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3158935Z graph_break [] 2025-12-04T10:01:23.3159077Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3159146Z Autotune Choices Stats: 2025-12-04T10:01:23.3160749Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:23.3161039Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3161332Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3161694Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3163040Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3164383Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3165668Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3166993Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3168270Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3169541Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3169835Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:23.3169903Z Autotune Choices Stats: 2025-12-04T10:01:23.3171592Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.3172152Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3172521Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3173163Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3174537Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3175861Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3177218Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3178544Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3179871Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3181229Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3182588Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3183954Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3185269Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3186598Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3186919Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:23.3187057Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3187129Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3187197Z unimplemented [] 2025-12-04T10:01:23.3187352Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3187558Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3188952Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3189015Z graph_break [] 2025-12-04T10:01:23.3189154Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3189221Z Autotune Choices Stats: 2025-12-04T10:01:23.3190868Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3191190Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3191432Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3191785Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3193068Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3194388Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3195669Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3196975Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3198255Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3199528Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3199810Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:23.3199878Z Autotune Choices Stats: 2025-12-04T10:01:23.3201557Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3202107Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3202469Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3203160Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3204500Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3205874Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3207198Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3208528Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3209856Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3211215Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3212560Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3213922Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3215250Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3216608Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3216892Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:23.3217036Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3217107Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3217178Z unimplemented [] 2025-12-04T10:01:23.3217287Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3217495Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3218891Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3218953Z graph_break [] 2025-12-04T10:01:23.3219096Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3219166Z Autotune Choices Stats: 2025-12-04T10:01:23.3220809Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3221128Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3221368Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3221730Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3223061Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3224344Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3225656Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3226931Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3228242Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3229551Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3229843Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:23.3229946Z Autotune Choices Stats: 2025-12-04T10:01:23.3231589Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.3232114Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3232529Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3233167Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3234514Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3235869Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3237205Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3238536Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3239900Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3241274Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3242639Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3243971Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3245327Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3246649Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3246937Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:23.3247079Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3247150Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3247219Z unimplemented [] 2025-12-04T10:01:23.3247325Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3247540Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3248967Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3249037Z graph_break [] 2025-12-04T10:01:23.3249211Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3249279Z Autotune Choices Stats: 2025-12-04T10:01:23.3250870Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3251161Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3251437Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3251796Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3253085Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3254393Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3255829Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3257118Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3258397Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3259944Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3260284Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:23.3260354Z Autotune Choices Stats: 2025-12-04T10:01:23.3262045Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3262568Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3262932Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3263626Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3264968Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3266287Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3267659Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3269065Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3270426Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3271783Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3273100Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3274455Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3275771Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3277097Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3277380Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:23.3277518Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3277588Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3277654Z unimplemented [] 2025-12-04T10:01:23.3277760Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3277965Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3279414Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3279511Z graph_break [] 2025-12-04T10:01:23.3279649Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3279718Z Autotune Choices Stats: 2025-12-04T10:01:23.3281344Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3281638Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3281874Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3282230Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3283552Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3284830Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3286110Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3287392Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3288710Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3290019Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3290309Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:23.3290377Z Autotune Choices Stats: 2025-12-04T10:01:23.3292042Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3292562Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3292964Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3293598Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3294938Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3296264Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3297592Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3298951Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3300313Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3301662Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3302983Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3304338Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3305653Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3306977Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3307312Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:23.3307457Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3307567Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3307632Z unimplemented [] 2025-12-04T10:01:23.3307780Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3307983Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3309376Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3309439Z graph_break [] 2025-12-04T10:01:23.3309577Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3309644Z Autotune Choices Stats: 2025-12-04T10:01:23.3311263Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3311551Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3311840Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3312196Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3313482Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3314764Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3316050Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3317363Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3318679Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3319982Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3320273Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:23.3320339Z Autotune Choices Stats: 2025-12-04T10:01:23.3321968Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3322518Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3322884Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3323521Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3324863Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3326195Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3327557Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3328913Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3330268Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3331591Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3332945Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3334275Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3335612Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3336967Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3337276Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:23.3337414Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3337482Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3337545Z unimplemented [] 2025-12-04T10:01:23.3337654Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3337857Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3339282Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3339353Z graph_break [] 2025-12-04T10:01:23.3339493Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3339561Z Autotune Choices Stats: 2025-12-04T10:01:23.3341152Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3341483Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3341721Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3342073Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3343356Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3344641Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3345946Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3347272Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3348606Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3349934Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3350219Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:23.3350322Z Autotune Choices Stats: 2025-12-04T10:01:23.3351978Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.3352493Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3352856Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3353509Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3354843Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3356375Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3357745Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3359122Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3360447Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3361811Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3363136Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3364455Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3365765Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3367120Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3367433Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:23.3367573Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3367646Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3367709Z unimplemented [] 2025-12-04T10:01:23.3367820Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3368025Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3369445Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3369513Z graph_break [] 2025-12-04T10:01:23.3369647Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3369753Z Autotune Choices Stats: 2025-12-04T10:01:23.3371358Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3371652Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3371901Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3372258Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3373538Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3374822Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3376145Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3377458Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3378768Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3380048Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3380377Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:23.3380450Z Autotune Choices Stats: 2025-12-04T10:01:23.3382096Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3382618Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3382979Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3383623Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3384994Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3386383Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3387796Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3389114Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3390439Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3391795Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3393121Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3394442Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3395798Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3397160Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3397438Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:23.3397614Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3397687Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3397751Z unimplemented [] 2025-12-04T10:01:23.3397863Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3398072Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3399463Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3399561Z graph_break [] 2025-12-04T10:01:23.3399696Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3399777Z Autotune Choices Stats: 2025-12-04T10:01:23.3401365Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3401657Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3401899Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3402255Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3403545Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3404859Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3406170Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3407480Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3408768Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3410097Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3410381Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:23.3410449Z Autotune Choices Stats: 2025-12-04T10:01:23.3412086Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.3412614Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3412972Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3413609Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3414974Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3416340Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3417702Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3419022Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3420416Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3421738Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3423069Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3424442Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3425802Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3427163Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3427493Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:23.3427637Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3427707Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3427771Z unimplemented [] 2025-12-04T10:01:23.3427880Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3428082Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3429512Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3429577Z graph_break [] 2025-12-04T10:01:23.3429709Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3429783Z Autotune Choices Stats: 2025-12-04T10:01:23.3431384Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3431677Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3431915Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3432271Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3433590Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3434909Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3436224Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3437499Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3438777Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3440085Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3440375Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:23.3440445Z Autotune Choices Stats: 2025-12-04T10:01:23.3442095Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:23.3442620Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3443019Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3443663Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3445034Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3446395Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3447719Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3449077Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3450398Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3451719Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3453077Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3454401Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3455875Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3457253Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3457535Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:23.3457740Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3457811Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3457875Z unimplemented [] 2025-12-04T10:01:23.3457987Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3458190Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3459585Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3459651Z graph_break [] 2025-12-04T10:01:23.3459786Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3459862Z Autotune Choices Stats: 2025-12-04T10:01:23.3461456Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3461749Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3461992Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3462354Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3463687Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3465001Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3466303Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3467680Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3468995Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3470280Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3470568Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:23.3470635Z Autotune Choices Stats: 2025-12-04T10:01:23.3472266Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.3472830Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3473224Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3473861Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3475223Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3476558Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3477920Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3479233Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3480570Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3481887Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3483237Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3484594Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3485954Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3487279Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3487595Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:23.3487736Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3487807Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3487869Z unimplemented [] 2025-12-04T10:01:23.3487983Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3488193Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3489582Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3489648Z graph_break [] 2025-12-04T10:01:23.3489781Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3489854Z Autotune Choices Stats: 2025-12-04T10:01:23.3491440Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3491767Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3492010Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3492397Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3493684Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3495011Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3496287Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3497602Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3498879Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3500160Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3500438Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:23.3500512Z Autotune Choices Stats: 2025-12-04T10:01:23.3502183Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3502745Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3503101Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3503744Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3505113Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3506453Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3507866Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3509184Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3510513Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3511869Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3513229Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3514583Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3515914Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3517279Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3517558Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:23.3517698Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3517770Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3517834Z unimplemented [] 2025-12-04T10:01:23.3517945Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3518153Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3519550Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3519614Z graph_break [] 2025-12-04T10:01:23.3519748Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3519821Z Autotune Choices Stats: 2025-12-04T10:01:23.3521449Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3521776Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3522014Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3522371Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3523693Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3524978Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3526290Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3527574Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3528856Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3530143Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3530425Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:23.3530534Z Autotune Choices Stats: 2025-12-04T10:01:23.3532167Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3532743Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3533142Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3533790Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3535130Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3536495Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3537820Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3539146Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3540500Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3541850Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3543212Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3544541Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3545903Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3547259Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3547550Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:23.3547691Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3547762Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3547828Z unimplemented [] 2025-12-04T10:01:23.3547937Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3548138Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3549525Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3549592Z graph_break [] 2025-12-04T10:01:23.3549724Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3549839Z Autotune Choices Stats: 2025-12-04T10:01:23.3551437Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3551763Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3552006Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3552393Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3553684Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3554963Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3556396Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3557686Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3558961Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3560299Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3560625Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:23.3560705Z Autotune Choices Stats: 2025-12-04T10:01:23.3562350Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:23.3562934Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3563296Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3563937Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3565329Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3566666Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3567998Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3569329Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3570698Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3572049Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3573441Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3574755Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3576120Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3577442Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3577724Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:23.3577862Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3577937Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3578012Z unimplemented [] 2025-12-04T10:01:23.3578125Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3578331Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3579759Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3579863Z graph_break [] 2025-12-04T10:01:23.3579997Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3580071Z Autotune Choices Stats: 2025-12-04T10:01:23.3581663Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.3581989Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3582232Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3582582Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3583877Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3585199Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3586474Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3587793Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3589112Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3590440Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3590717Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:23.3590794Z Autotune Choices Stats: 2025-12-04T10:01:23.3592474Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.3592996Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3593358Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3594030Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3595364Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3596712Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3598033Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3599395Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3600747Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3602095Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3603415Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3604780Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3606103Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3607436Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3607722Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:23.3607860Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3607938Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3608000Z unimplemented [] 2025-12-04T10:01:23.3608114Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3608352Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3609780Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3609848Z graph_break [] 2025-12-04T10:01:23.3609985Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3610062Z Autotune Choices Stats: 2025-12-04T10:01:23.3611687Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.3611986Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3612226Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3612607Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3613901Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3615175Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3616456Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3617737Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3619052Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3620362Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3620675Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:23.3620749Z Autotune Choices Stats: 2025-12-04T10:01:23.3622383Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:23.3622953Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3623317Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3623958Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3625293Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3626632Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3628056Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3629415Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3630774Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3632092Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3633445Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3634766Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3636093Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3637411Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3637724Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:23.3637860Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3637970Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3638037Z unimplemented [] 2025-12-04T10:01:23.3638146Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3638347Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3639737Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3639805Z graph_break [] 2025-12-04T10:01:23.3639997Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3640078Z Autotune Choices Stats: 2025-12-04T10:01:23.3641650Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.3641974Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3642216Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3642567Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3643877Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3645152Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3646440Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3647755Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3649077Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3650399Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3650684Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:23.3650759Z Autotune Choices Stats: 2025-12-04T10:01:23.3652387Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:23.3652947Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3653306Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3653953Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3655443Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3656838Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3658157Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3659527Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3660895Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3662218Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3663584Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3664909Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3666232Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3667647Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3667962Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:23.3668099Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3668175Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3668238Z unimplemented [] 2025-12-04T10:01:23.3668342Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3668557Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3669985Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3670053Z graph_break [] 2025-12-04T10:01:23.3670190Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3670265Z Autotune Choices Stats: 2025-12-04T10:01:23.3671854Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3672183Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3672424Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3672774Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3674067Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3675344Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3676659Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3677990Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3679292Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3680578Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3680887Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:23.3680961Z Autotune Choices Stats: 2025-12-04T10:01:23.3682588Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:23.3683110Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3683475Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3684116Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3685462Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3686832Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3688188Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3689554Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3690882Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3692241Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3693576Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3694896Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3696256Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3697603Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3697882Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:23.3698079Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:23.3698165Z Traceback (most recent call last): 2025-12-04T10:01:23.3698552Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:23.3698627Z self.assertTrue( 2025-12-04T10:01:23.3698862Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:23.3698947Z raise self.failureException(msg) 2025-12-04T10:01:23.3699228Z AssertionError: False is not true : Log file /tmp/tmpaff1vcq5/flex_attention_configs.json was not created 2025-12-04T10:01:23.3699233Z 2025-12-04T10:01:23.3699379Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:23.3699679Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:23.3699718Z 2025-12-04T10:01:23.3699903Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:23.3700045Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3700125Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3700188Z unimplemented [] 2025-12-04T10:01:23.3700299Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3701703Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:23.3701907Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3701975Z graph_break [] 2025-12-04T10:01:23.3702109Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3703280Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:23.3703371Z current_size = base.storage().size() 2025-12-04T10:01:23.3703440Z Autotune Choices Stats: 2025-12-04T10:01:23.3705077Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:23.3705402Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3705658Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3706016Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3707355Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3708659Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3709944Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3711258Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3712531Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3713809Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3714091Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:23.3714163Z Autotune Choices Stats: 2025-12-04T10:01:23.3715864Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.3716423Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3716791Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3717474Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3718810Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3720165Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3721480Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3722797Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3724110Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3725476Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3726865Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3728216Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3729534Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3730874Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3731162Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:23.3731300Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3731372Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3731443Z unimplemented [] 2025-12-04T10:01:23.3731551Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3731768Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3733171Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3733240Z graph_break [] 2025-12-04T10:01:23.3733375Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3733445Z Autotune Choices Stats: 2025-12-04T10:01:23.3735071Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3735395Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3735647Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3736002Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3737324Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3738604Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3739916Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3741187Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3742459Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3743726Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3744041Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:23.3744143Z Autotune Choices Stats: 2025-12-04T10:01:23.3745793Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3746313Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3746711Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3747400Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3748740Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3750117Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3751447Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3752760Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3754110Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3755586Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3756976Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3758316Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3759685Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3760996Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3761284Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:23.3761422Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3761494Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3761565Z unimplemented [] 2025-12-04T10:01:23.3761672Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3761884Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3763277Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3763353Z graph_break [] 2025-12-04T10:01:23.3763722Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3763846Z Autotune Choices Stats: 2025-12-04T10:01:23.3765441Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3765743Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3766030Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3766388Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3767671Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3768973Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3770258Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3771528Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3772805Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3774116Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3774432Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:23.3774502Z Autotune Choices Stats: 2025-12-04T10:01:23.3776171Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3776692Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3777059Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3777697Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3779064Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3780385Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3781714Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3783065Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3784430Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3785800Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3787116Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3788487Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3789852Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3791166Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3791452Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:23.3791586Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3791659Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3791729Z unimplemented [] 2025-12-04T10:01:23.3791836Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3792042Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3793472Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3793569Z graph_break [] 2025-12-04T10:01:23.3793712Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3793780Z Autotune Choices Stats: 2025-12-04T10:01:23.3795404Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3795693Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3795936Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3796287Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3797583Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3798893Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3800171Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3801449Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3802772Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3804080Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3804365Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:23.3804433Z Autotune Choices Stats: 2025-12-04T10:01:23.3806110Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3806637Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3807039Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3807676Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3809012Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3810348Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3811672Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3813040Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3814404Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3815749Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3817075Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3818436Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3819754Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3821073Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3821364Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:23.3821503Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3821572Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3821643Z unimplemented [] 2025-12-04T10:01:23.3821783Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3821991Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3823432Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3823497Z graph_break [] 2025-12-04T10:01:23.3823639Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3823709Z Autotune Choices Stats: 2025-12-04T10:01:23.3825342Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3825627Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3825874Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3826261Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3827601Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3828886Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3830171Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3831484Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3832762Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3834074Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3834390Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:23.3834460Z Autotune Choices Stats: 2025-12-04T10:01:23.3836112Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3836674Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3837044Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3837690Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3839028Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3840354Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3841727Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3843089Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3844439Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3845763Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3847109Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3848439Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3849754Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3851077Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3851402Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:23.3851573Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3851644Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3851713Z unimplemented [] 2025-12-04T10:01:23.3851821Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3852025Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3853429Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3853492Z graph_break [] 2025-12-04T10:01:23.3853672Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3853742Z Autotune Choices Stats: 2025-12-04T10:01:23.3855532Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3855949Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3856241Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3856673Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3858105Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3859390Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3860683Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3862027Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3863346Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3864669Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3864963Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:23.3865030Z Autotune Choices Stats: 2025-12-04T10:01:23.3866687Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3867280Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3867655Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3868306Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3869655Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3871025Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3872397Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3873750Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3875080Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3876453Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3877782Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3879114Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3880435Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3881845Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3882163Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:23.3882300Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3882373Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3882441Z unimplemented [] 2025-12-04T10:01:23.3882549Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3882755Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3884173Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3884239Z graph_break [] 2025-12-04T10:01:23.3884376Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3884442Z Autotune Choices Stats: 2025-12-04T10:01:23.3886061Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3886387Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3886628Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3886987Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3888283Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3889560Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3890877Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3892197Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3893515Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3894797Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3895130Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:23.3895199Z Autotune Choices Stats: 2025-12-04T10:01:23.3896863Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.3897380Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3897751Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3898404Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3899748Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3901113Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3902472Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3903831Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3905157Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3906530Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3907890Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3909223Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3910593Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3912017Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3912335Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:23.3912476Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3912549Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3912617Z unimplemented [] 2025-12-04T10:01:23.3912757Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3912961Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3914358Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3914455Z graph_break [] 2025-12-04T10:01:23.3914595Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3914662Z Autotune Choices Stats: 2025-12-04T10:01:23.3916270Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:23.3916556Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3916795Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3917158Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3918446Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3919763Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3921076Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3922392Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3923671Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3924954Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3925268Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:23.3925337Z Autotune Choices Stats: 2025-12-04T10:01:23.3926992Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3927511Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3927883Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3928522Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3929897Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3931261Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3932645Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3934153Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3935543Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3936874Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3938198Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3939566Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3940893Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3942286Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3942572Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:23.3942715Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3942788Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3942859Z unimplemented [] 2025-12-04T10:01:23.3942968Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3943172Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3944571Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3944670Z graph_break [] 2025-12-04T10:01:23.3944813Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3944884Z Autotune Choices Stats: 2025-12-04T10:01:23.3946484Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:23.3946777Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3947032Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3947474Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3948766Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3950088Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3951400Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3952725Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3954024Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3955498Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3955792Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:23.3955863Z Autotune Choices Stats: 2025-12-04T10:01:23.3957511Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3958032Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3958398Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3959109Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3960512Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3961884Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3963210Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3964580Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3965909Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3967227Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3968554Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3969921Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3971288Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3972662Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.3972946Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:23.3973088Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.3973192Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.3973259Z unimplemented [] 2025-12-04T10:01:23.3973375Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.3973584Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.3974983Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.3975047Z graph_break [] 2025-12-04T10:01:23.3975185Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.3975256Z Autotune Choices Stats: 2025-12-04T10:01:23.3976860Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.3977147Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3977388Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3977750Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3979079Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3980391Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3981712Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.3983002Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.3984318Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3985597Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3985897Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:23.3985970Z Autotune Choices Stats: 2025-12-04T10:01:23.3987665Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.3988183Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.3988588Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.3989262Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.3990607Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3991966Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.3993301Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3994663Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.3995992Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.3997318Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.3998683Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4000044Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4001393Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4002720Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4003033Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:23.4003191Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4003264Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4003332Z unimplemented [] 2025-12-04T10:01:23.4003439Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4003648Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4005044Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4005109Z graph_break [] 2025-12-04T10:01:23.4005249Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4005322Z Autotune Choices Stats: 2025-12-04T10:01:23.4006921Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:23.4010648Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4011038Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4011449Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4012762Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4014104Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4015393Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4016722Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4018003Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4019278Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4019573Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:23.4019648Z Autotune Choices Stats: 2025-12-04T10:01:23.4021345Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4021905Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4022279Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4022922Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4024308Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4025644Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4027006Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4028420Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4029748Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4031108Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4032465Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4033816Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4035130Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4036537Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4036824Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:23.4036978Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4037052Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4037118Z unimplemented [] 2025-12-04T10:01:23.4037244Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4037454Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4038864Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4038929Z graph_break [] 2025-12-04T10:01:23.4039071Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4039141Z Autotune Choices Stats: 2025-12-04T10:01:23.4040774Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4041108Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4041351Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4041706Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4043037Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4044318Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4045596Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4046927Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4048203Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4049479Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4049768Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:23.4049836Z Autotune Choices Stats: 2025-12-04T10:01:23.4051502Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4052062Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4052424Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4053099Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4054434Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4055995Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4057330Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4058660Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4059982Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4061370Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4062746Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4064123Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4065447Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4066820Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4067100Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:23.4067295Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4067367Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4067431Z unimplemented [] 2025-12-04T10:01:23.4067547Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4067763Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4069156Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4069220Z graph_break [] 2025-12-04T10:01:23.4069357Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4069436Z Autotune Choices Stats: 2025-12-04T10:01:23.4071078Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:23.4071420Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4071665Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4072029Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4073350Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4074634Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4075936Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4077219Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4078498Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4079801Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4080120Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:23.4080187Z Autotune Choices Stats: 2025-12-04T10:01:23.4081828Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.4082405Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4082775Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4083418Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4084786Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4086128Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4087461Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4088785Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4090143Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4091492Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4092851Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4094174Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4095535Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4096865Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4097150Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:23.4097297Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4097369Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4097434Z unimplemented [] 2025-12-04T10:01:23.4097547Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4097752Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4099187Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4099286Z graph_break [] 2025-12-04T10:01:23.4099434Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4099513Z Autotune Choices Stats: 2025-12-04T10:01:23.4101130Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4101438Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4101715Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4102076Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4103378Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4104710Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4106005Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4107378Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4108657Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4109991Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4110315Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:23.4110384Z Autotune Choices Stats: 2025-12-04T10:01:23.4112061Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4112588Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4112951Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4113625Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4114965Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4116298Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4117645Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4119008Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4120386Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4121744Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4123075Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4124431Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4125755Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4127083Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4127367Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:23.4127509Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4127581Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4127648Z unimplemented [] 2025-12-04T10:01:23.4127760Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4127967Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4129411Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4129508Z graph_break [] 2025-12-04T10:01:23.4129644Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4129722Z Autotune Choices Stats: 2025-12-04T10:01:23.4131358Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4131656Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4131898Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4132262Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4133582Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4134860Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4136154Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4137438Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4138748Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4140062Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4140347Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:23.4140417Z Autotune Choices Stats: 2025-12-04T10:01:23.4142434Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.4143209Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4143586Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4144235Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4145587Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4146925Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4148406Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4149742Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4151151Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4152467Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4153799Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4155424Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4156826Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4158157Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4158446Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:23.4158702Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4158781Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4158901Z unimplemented [] 2025-12-04T10:01:23.4159018Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4159228Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4160626Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4160694Z graph_break [] 2025-12-04T10:01:23.4160832Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4160914Z Autotune Choices Stats: 2025-12-04T10:01:23.4162566Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4162869Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4163157Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4163526Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4164830Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4166109Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4167387Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4168721Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4170038Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4171353Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4171639Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:23.4171716Z Autotune Choices Stats: 2025-12-04T10:01:23.4173363Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4173933Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4174296Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4174942Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4176281Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4177615Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4178984Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4180334Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4181690Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4183033Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4184390Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4185713Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4187035Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4188475Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4188791Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:23.4188934Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4189005Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4189068Z unimplemented [] 2025-12-04T10:01:23.4189179Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4189387Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4190874Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4190945Z graph_break [] 2025-12-04T10:01:23.4191079Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4191164Z Autotune Choices Stats: 2025-12-04T10:01:23.4192760Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4193112Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4193354Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4193717Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4195005Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4196290Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4197607Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4198919Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4200185Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4201503Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4201784Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:23.4201897Z Autotune Choices Stats: 2025-12-04T10:01:23.4203550Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4204081Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4204445Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4205105Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4206479Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4207859Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4209235Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4210611Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4211950Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4213312Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4214644Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4215990Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4217326Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4218695Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4219025Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:23.4219176Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4219250Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4219315Z unimplemented [] 2025-12-04T10:01:23.4219431Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4219638Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4221075Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4221151Z graph_break [] 2025-12-04T10:01:23.4221287Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4221398Z Autotune Choices Stats: 2025-12-04T10:01:23.4223013Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4223314Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4223561Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4223926Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4225234Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4226529Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4227912Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4229241Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4230578Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4231870Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4232188Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:23.4232275Z Autotune Choices Stats: 2025-12-04T10:01:23.4233926Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4234452Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4234818Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4235477Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4236864Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4238247Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4239616Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4240967Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4242342Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4243672Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4245011Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4246352Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4247723Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4249093Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4249381Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:23.4249560Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4249646Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4249711Z unimplemented [] 2025-12-04T10:01:23.4249825Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4250033Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4251426Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4251531Z graph_break [] 2025-12-04T10:01:23.4251668Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4251747Z Autotune Choices Stats: 2025-12-04T10:01:23.4253352Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4253643Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4253884Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4254253Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4255711Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4257234Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4258596Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4259930Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4261221Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4262574Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4262857Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:23.4262935Z Autotune Choices Stats: 2025-12-04T10:01:23.4264588Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.4265114Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4265497Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4266153Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4267602Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4269004Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4270383Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4271732Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4273093Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4274433Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4275790Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4277600Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4278997Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4280381Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4280670Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:23.4280811Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4280891Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4280957Z unimplemented [] 2025-12-04T10:01:23.4281069Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4281308Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4282701Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4282769Z graph_break [] 2025-12-04T10:01:23.4282905Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4282980Z Autotune Choices Stats: 2025-12-04T10:01:23.4284597Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4284890Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4285133Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4285490Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4286843Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4288172Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4289500Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4290789Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4292115Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4293405Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4293692Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:23.4293770Z Autotune Choices Stats: 2025-12-04T10:01:23.4295421Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4295953Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4296357Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4297045Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4298396Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4299777Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4301115Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4302485Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4303821Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4305151Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4306551Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4307964Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4309333Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4310673Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4310960Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:23.4311133Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4311211Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4311285Z unimplemented [] 2025-12-04T10:01:23.4311403Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4311614Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4313015Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4313089Z graph_break [] 2025-12-04T10:01:23.4313225Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4313302Z Autotune Choices Stats: 2025-12-04T10:01:23.4315018Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4315326Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4315567Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4315989Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4317302Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4318630Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4319951Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4321241Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4322569Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4323865Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4324151Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:23.4324227Z Autotune Choices Stats: 2025-12-04T10:01:23.4325891Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.4326466Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4326864Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4327520Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4328895Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4330247Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4331616Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4332952Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4334280Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4335688Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4337133Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4338506Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4339871Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4341222Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4341571Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:23.4341713Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4341793Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4341859Z unimplemented [] 2025-12-04T10:01:23.4341970Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4342177Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4343574Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4343646Z graph_break [] 2025-12-04T10:01:23.4343781Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4343859Z Autotune Choices Stats: 2025-12-04T10:01:23.4345475Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4345807Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4346147Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4346506Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4347861Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4349183Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4350476Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4351805Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4353092Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4354390Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4354673Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:23.4354749Z Autotune Choices Stats: 2025-12-04T10:01:23.4356611Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:23.4357187Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4357550Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4358247Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4359605Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4360950Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4362332Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4363673Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4365009Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4366383Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4367748Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4369127Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4370463Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4371844Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4372139Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:23.4372283Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4372361Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4372431Z unimplemented [] 2025-12-04T10:01:23.4372541Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4372763Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4374594Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4374682Z graph_break [] 2025-12-04T10:01:23.4374828Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4374909Z Autotune Choices Stats: 2025-12-04T10:01:23.4376603Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4376934Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4377179Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4377543Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4378912Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4380201Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4381531Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4382826Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4384122Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4385446Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4385765Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:23.4385843Z Autotune Choices Stats: 2025-12-04T10:01:23.4387606Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.4388138Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4388551Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4389220Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4390561Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4391934Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4393265Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4394604Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4395978Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4397337Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4398700Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4400028Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4401410Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4402738Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4403028Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:23.4403171Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4403252Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4403317Z unimplemented [] 2025-12-04T10:01:23.4403429Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4403647Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4405038Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4405110Z graph_break [] 2025-12-04T10:01:23.4405293Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4405368Z Autotune Choices Stats: 2025-12-04T10:01:23.4407007Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4407300Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4407553Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4407943Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4409251Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4410563Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4411852Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4413153Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4414436Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4415771Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4416102Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:23.4416176Z Autotune Choices Stats: 2025-12-04T10:01:23.4417855Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4418385Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4418749Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4419397Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4420781Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4422113Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4423448Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4424787Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4426155Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4427582Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4428961Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4430291Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4431662Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4432994Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4433284Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:23.4433421Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4433500Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4433565Z unimplemented [] 2025-12-04T10:01:23.4433674Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4433889Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4435322Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4435423Z graph_break [] 2025-12-04T10:01:23.4435570Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4435640Z Autotune Choices Stats: 2025-12-04T10:01:23.4437257Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4437614Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4437872Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4438229Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4439536Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4440854Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4444434Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4445787Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4447164Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4448512Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4448811Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:23.4448894Z Autotune Choices Stats: 2025-12-04T10:01:23.4450553Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4451080Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4451486Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4452129Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4453482Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4454907Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4456393Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4457806Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4459174Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4460506Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4461821Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4463193Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4464516Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4465900Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4466201Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:23.4466351Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4466423Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4466494Z unimplemented [] 2025-12-04T10:01:23.4466601Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4466849Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4468349Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4468414Z graph_break [] 2025-12-04T10:01:23.4468556Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4468626Z Autotune Choices Stats: 2025-12-04T10:01:23.4470225Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4470514Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4470758Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4471153Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4472438Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4473719Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4475053Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4476326Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4477636Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4478944Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4479233Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:23.4479303Z Autotune Choices Stats: 2025-12-04T10:01:23.4480948Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:23.4481501Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4481875Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4482519Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4483893Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4485216Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4486589Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4487945Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4489263Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4490581Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4491952Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4493272Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4494630Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4495968Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4496295Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:23.4496465Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4496537Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4496608Z unimplemented [] 2025-12-04T10:01:23.4496713Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4496917Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4498313Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4498389Z graph_break [] 2025-12-04T10:01:23.4498533Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4498601Z Autotune Choices Stats: 2025-12-04T10:01:23.4500205Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.4500528Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4500772Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4501127Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4502408Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4503764Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4505043Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4506346Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4507730Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4509006Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4509291Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:23.4509358Z Autotune Choices Stats: 2025-12-04T10:01:23.4510997Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.4511555Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4511925Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4512561Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4513948Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4515308Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4516673Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4517995Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4519318Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4520638Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4521997Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4523360Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4524675Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4526032Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4526366Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:23.4526502Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4526573Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4526643Z unimplemented [] 2025-12-04T10:01:23.4526749Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4526970Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4528370Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4528434Z graph_break [] 2025-12-04T10:01:23.4528574Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4528644Z Autotune Choices Stats: 2025-12-04T10:01:23.4530240Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.4530568Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4530803Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4531161Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4532484Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4533761Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4535072Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4536382Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4537659Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4538935Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4539268Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:23.4539335Z Autotune Choices Stats: 2025-12-04T10:01:23.4540971Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:23.4541487Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4541889Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4542529Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4543869Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4545231Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4546591Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4547955Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4549282Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4550641Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4551991Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4553317Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4554661Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4556124Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4556422Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:23.4556557Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4556628Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4556696Z unimplemented [] 2025-12-04T10:01:23.4556802Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4557007Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4558400Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4558527Z graph_break [] 2025-12-04T10:01:23.4558665Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4558731Z Autotune Choices Stats: 2025-12-04T10:01:23.4560336Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.4560630Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4560873Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4561231Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4562566Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4563902Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4565238Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4566512Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4567799Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4569077Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4569396Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:23.4569462Z Autotune Choices Stats: 2025-12-04T10:01:23.4571100Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:23.4571662Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4572211Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4572966Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4574360Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4575707Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4577034Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4578352Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4579708Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4581030Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4582387Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4583743Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4585055Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4586413Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4586699Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:23.4586840Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4586912Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4586992Z unimplemented [] 2025-12-04T10:01:23.4587103Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4587371Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4588767Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4588875Z graph_break [] 2025-12-04T10:01:23.4589018Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4589085Z Autotune Choices Stats: 2025-12-04T10:01:23.4590677Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4591002Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4591245Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4591611Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4592909Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4594230Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4595551Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4596851Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4598140Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4599448Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4599739Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:23.4599807Z Autotune Choices Stats: 2025-12-04T10:01:23.4601516Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:23.4602041Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4602409Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4603083Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4604453Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4605779Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4607113Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4608469Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4609799Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4611172Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4612492Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4613848Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4615193Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4616522Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4616800Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:23.4616941Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4617046Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4617114Z unimplemented [] 2025-12-04T10:01:23.4617220Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4617426Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4618813Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4618876Z graph_break [] 2025-12-04T10:01:23.4619013Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4619079Z Autotune Choices Stats: 2025-12-04T10:01:23.4620714Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_574", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4621007Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4621243Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4621600Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4622929Z triton_flex_attention_574 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4624250Z triton_flex_attention_575 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4625527Z triton_flex_attention_572 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4626803Z triton_flex_attention_570 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4628172Z triton_flex_attention_573 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4629448Z triton_flex_attention_571 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4629736Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3286 seconds precompiling for 6 choices 2025-12-04T10:01:23.4629841Z Autotune Choices Stats: 2025-12-04T10:01:23.4631484Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014303999952971935, "best_triton_pos": 0} 2025-12-04T10:01:23.4632002Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4632400Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4633076Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4634407Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4635728Z triton_flex_attention_backward_576 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4637061Z triton_flex_attention_backward_577 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4638429Z triton_flex_attention_backward_578 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4639776Z triton_flex_attention_backward_581 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4641097Z triton_flex_attention_backward_583 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4642448Z triton_flex_attention_backward_580 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4643801Z triton_flex_attention_backward_582 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4645118Z triton_flex_attention_backward_585 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4646458Z triton_flex_attention_backward_588 0.0174 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4646776Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2260 seconds precompiling for 13 choices 2025-12-04T10:01:23.4646977Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:23.4647060Z Traceback (most recent call last): 2025-12-04T10:01:23.4647411Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:23.4647477Z self.assertTrue( 2025-12-04T10:01:23.4647718Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:23.4647807Z raise self.failureException(msg) 2025-12-04T10:01:23.4648083Z AssertionError: False is not true : Log file /tmp/tmpr5b4038i/flex_attention_configs.json was not created 2025-12-04T10:01:23.4648089Z 2025-12-04T10:01:23.4648237Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:23.4648533Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:23.4648537Z 2025-12-04T10:01:23.4648715Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:23.4648895Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4648973Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4649036Z unimplemented [] 2025-12-04T10:01:23.4649150Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4650555Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:23.4650770Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4650864Z graph_break [] 2025-12-04T10:01:23.4651003Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4652220Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:23.4652304Z current_size = base.storage().size() 2025-12-04T10:01:23.4652381Z Autotune Choices Stats: 2025-12-04T10:01:23.4653980Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:23.4654279Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4654521Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4654891Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4656474Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4657751Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4659108Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4660390Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4661713Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4663051Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4663334Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:23.4663413Z Autotune Choices Stats: 2025-12-04T10:01:23.4665042Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.4665576Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4665990Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4666637Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4668007Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4669375Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4670704Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4672055Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4673496Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4674822Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4676149Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4677519Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4678866Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4680193Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4680474Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:23.4680618Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4680722Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4680786Z unimplemented [] 2025-12-04T10:01:23.4680931Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4681135Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4682523Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4682592Z graph_break [] 2025-12-04T10:01:23.4682725Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4682798Z Autotune Choices Stats: 2025-12-04T10:01:23.4684386Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4684690Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4684972Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4685329Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4686624Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4687932Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4689211Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4690523Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4691824Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4693101Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4693381Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:23.4693453Z Autotune Choices Stats: 2025-12-04T10:01:23.4695084Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4695644Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4696016Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4696659Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4698025Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4699354Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4700703Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4702048Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4703363Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4704689Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4706062Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4707424Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4708777Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4710157Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4710478Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:23.4710614Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4710692Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4710755Z unimplemented [] 2025-12-04T10:01:23.4710865Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4711066Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4712457Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4712527Z graph_break [] 2025-12-04T10:01:23.4712662Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4712735Z Autotune Choices Stats: 2025-12-04T10:01:23.4714326Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4714658Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4714897Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4715248Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4716543Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4717859Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4719134Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4720442Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4721756Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4723044Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4723325Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:23.4723412Z Autotune Choices Stats: 2025-12-04T10:01:23.4725087Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4725632Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4725993Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4726676Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4728009Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4729374Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4730723Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4732047Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4733365Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4734712Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4736033Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4737379Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4738702Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4740056Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4740369Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:23.4740503Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4740582Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4740646Z unimplemented [] 2025-12-04T10:01:23.4740758Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4740975Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4742371Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4742439Z graph_break [] 2025-12-04T10:01:23.4742573Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4742646Z Autotune Choices Stats: 2025-12-04T10:01:23.4744282Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4744584Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4744818Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4745165Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4746519Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4747868Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4749187Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4750493Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4751763Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4753039Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4753361Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:23.4753436Z Autotune Choices Stats: 2025-12-04T10:01:23.4755070Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4755730Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4756165Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4756811Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4758192Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4759517Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4760878Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4762206Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4763526Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4764897Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4766259Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4767578Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4768933Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4770285Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4770570Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:23.4770722Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4770805Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4770872Z unimplemented [] 2025-12-04T10:01:23.4770980Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4771192Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4772586Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4772692Z graph_break [] 2025-12-04T10:01:23.4772831Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4772905Z Autotune Choices Stats: 2025-12-04T10:01:23.4774498Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4774793Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4775034Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4775423Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4776721Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4778021Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4779331Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4780611Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4781880Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4783209Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4783491Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:23.4783573Z Autotune Choices Stats: 2025-12-04T10:01:23.4785216Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4785905Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4786271Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4786911Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4788332Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4789700Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4791026Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4792356Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4793725Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4795048Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4796404Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4797756Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4799112Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4800428Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4800722Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:23.4800866Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4800947Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4801010Z unimplemented [] 2025-12-04T10:01:23.4801117Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4801328Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4802759Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4802831Z graph_break [] 2025-12-04T10:01:23.4802966Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4803042Z Autotune Choices Stats: 2025-12-04T10:01:23.4804628Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4804960Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4805200Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4805553Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4806884Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4808159Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4809517Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4810795Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4812053Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4813358Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4813640Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:23.4813717Z Autotune Choices Stats: 2025-12-04T10:01:23.4815434Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4816052Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4816486Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4817285Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4818690Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4820032Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4821354Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4822712Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4824029Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4825446Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4827024Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4828433Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4829792Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4831128Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4831422Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:23.4831556Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4831671Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4831733Z unimplemented [] 2025-12-04T10:01:23.4831839Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4832051Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4833448Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4833517Z graph_break [] 2025-12-04T10:01:23.4833653Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4833723Z Autotune Choices Stats: 2025-12-04T10:01:23.4835356Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4835646Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4835885Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4836238Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4837564Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4838871Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4840202Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4841510Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4842851Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4844135Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4844451Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:23.4844531Z Autotune Choices Stats: 2025-12-04T10:01:23.4846180Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.4846742Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4847109Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4847781Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4849117Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4850450Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4851789Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4853156Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4854505Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4857695Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4859161Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4860548Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4861874Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4863181Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4863535Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:23.4863684Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4863771Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4863842Z unimplemented [] 2025-12-04T10:01:23.4863953Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4864169Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4865557Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4865637Z graph_break [] 2025-12-04T10:01:23.4865825Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4865900Z Autotune Choices Stats: 2025-12-04T10:01:23.4867597Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:23.4867891Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4868182Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4868574Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4869874Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4871153Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4872446Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4873770Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4875041Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4876369Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4876657Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:23.4876733Z Autotune Choices Stats: 2025-12-04T10:01:23.4878413Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4878974Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4879336Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4879984Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4881316Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4882651Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4884006Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4885362Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4886686Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4888045Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4889394Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4890709Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4892037Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4893389Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4893680Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:23.4893818Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4893894Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4893958Z unimplemented [] 2025-12-04T10:01:23.4894068Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4894281Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4895720Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4895791Z graph_break [] 2025-12-04T10:01:23.4895923Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4895990Z Autotune Choices Stats: 2025-12-04T10:01:23.4897625Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:23.4897947Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4898192Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4898545Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4899848Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4901113Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4902443Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4903725Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4905039Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4906317Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4906596Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:23.4906669Z Autotune Choices Stats: 2025-12-04T10:01:23.4908393Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4908953Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4909319Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4909963Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4911304Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4912668Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4913984Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4915339Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4916688Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4918009Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4919362Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4920681Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4922006Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4923359Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4923661Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:23.4923804Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4923878Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4923943Z unimplemented [] 2025-12-04T10:01:23.4924091Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4924303Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4925693Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4925761Z graph_break [] 2025-12-04T10:01:23.4925895Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4925963Z Autotune Choices Stats: 2025-12-04T10:01:23.4927600Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4927917Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4928165Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4928522Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4929814Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4931087Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4932435Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4933745Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4935029Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4936337Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4936653Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:23.4936722Z Autotune Choices Stats: 2025-12-04T10:01:23.4938374Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4938901Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4939264Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4939902Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4941438Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4942781Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4944152Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4945485Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4946839Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4948252Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4949578Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4950900Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4952259Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4953568Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4953897Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:23.4954036Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4954114Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4954178Z unimplemented [] 2025-12-04T10:01:23.4954284Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4954504Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4956087Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4956216Z graph_break [] 2025-12-04T10:01:23.4956356Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4956428Z Autotune Choices Stats: 2025-12-04T10:01:23.4958046Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:23.4958334Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4958581Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4958933Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4960232Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4961562Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4962843Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4964173Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4965454Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4966816Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4967147Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:23.4967213Z Autotune Choices Stats: 2025-12-04T10:01:23.4968864Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4969385Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4969747Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4970432Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4971788Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4973155Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4974483Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4975840Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4977183Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4978506Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4979827Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.4981179Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.4982498Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.4983847Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.4984135Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:23.4984269Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.4984343Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.4984408Z unimplemented [] 2025-12-04T10:01:23.4984513Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.4984722Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.4986143Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.4986242Z graph_break [] 2025-12-04T10:01:23.4986376Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.4986444Z Autotune Choices Stats: 2025-12-04T10:01:23.4988087Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.4988378Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4988625Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.4988980Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.4990325Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4991599Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4992926Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.4994208Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.4995520Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4996830Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.4997109Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:23.4997176Z Autotune Choices Stats: 2025-12-04T10:01:23.4998812Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.4999385Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.4999751Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5000386Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5001723Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5003089Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5004462Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5005826Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5007146Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5008468Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5009820Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5011143Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5012505Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.5013825Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5014112Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:23.5014281Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.5014392Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.5014456Z unimplemented [] 2025-12-04T10:01:23.5014561Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.5014768Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.5016160Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.5016229Z graph_break [] 2025-12-04T10:01:23.5016365Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.5016435Z Autotune Choices Stats: 2025-12-04T10:01:23.5018039Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:23.5018364Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5018611Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5018964Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5020267Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5021585Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5022871Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5024190Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.5025510Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.5026795Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5027073Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:23.5027141Z Autotune Choices Stats: 2025-12-04T10:01:23.5028838Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.5029394Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5029761Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5030396Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5031769Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5033102Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5034467Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5035844Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5037177Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5038626Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5039989Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5041387Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5042713Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5044067Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5044384Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:23.5044518Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.5044598Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.5044669Z unimplemented [] 2025-12-04T10:01:23.5044773Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.5044983Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.5046380Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.5046448Z graph_break [] 2025-12-04T10:01:23.5046579Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.5046645Z Autotune Choices Stats: 2025-12-04T10:01:23.5048253Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.5048586Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5048829Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5049182Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5050475Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5051785Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5053110Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.5054423Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.5055850Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5057140Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5057417Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:23.5057555Z Autotune Choices Stats: 2025-12-04T10:01:23.5059197Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.5059715Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5060081Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5060778Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5062122Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5063485Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5064848Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5066179Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5067561Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5068921Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5070233Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5071595Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5072951Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5074273Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.5074598Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:23.5074735Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.5074805Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.5074876Z unimplemented [] 2025-12-04T10:01:23.5074983Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.5075196Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.5076600Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.5076668Z graph_break [] 2025-12-04T10:01:23.5076859Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.5076926Z Autotune Choices Stats: 2025-12-04T10:01:23.5078531Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.5078816Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5079074Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5079427Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5080772Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5082051Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5083361Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5084670Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.5085976Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.5087258Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5087579Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:23.5087649Z Autotune Choices Stats: 2025-12-04T10:01:23.5089287Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.5089806Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5090222Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5090860Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5092237Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5093596Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5094919Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5096241Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5097603Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5098920Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5100269Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5101594Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5102949Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.5104297Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5104581Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:23.5104716Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.5104784Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.5104851Z unimplemented [] 2025-12-04T10:01:23.5104954Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.5105159Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.5106551Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.5106649Z graph_break [] 2025-12-04T10:01:23.5106790Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.5106857Z Autotune Choices Stats: 2025-12-04T10:01:23.5108502Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.5108791Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5109074Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5109427Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5110722Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5112027Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5113359Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5114636Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.5115925Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.5117236Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5117524Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:23.5117591Z Autotune Choices Stats: 2025-12-04T10:01:23.5119307Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.5119830Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5120200Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5120875Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5122213Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5123563Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5124886Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5126211Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5127562Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5128921Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5130244Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5131595Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5132940Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5134256Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.5134541Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:23.5134675Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.5134743Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.5134812Z unimplemented [] 2025-12-04T10:01:23.5134953Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.5135154Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.5136551Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.5136614Z graph_break [] 2025-12-04T10:01:23.5136765Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.5136834Z Autotune Choices Stats: 2025-12-04T10:01:23.5138468Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.5138756Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5139002Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5139353Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5140675Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5141980Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5143267Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.5144540Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.5145860Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5147140Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5147484Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:23.5147554Z Autotune Choices Stats: 2025-12-04T10:01:23.5149238Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.5149760Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5150181Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5150850Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5157344Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5158740Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5160071Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5161511Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5162852Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5164255Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5165623Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5166986Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5168304Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5169609Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.5169955Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:23.5170110Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.5170187Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.5170261Z unimplemented [] 2025-12-04T10:01:23.5170375Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.5170590Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.5171995Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.5172065Z graph_break [] 2025-12-04T10:01:23.5172222Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.5172295Z Autotune Choices Stats: 2025-12-04T10:01:23.5173973Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.5174270Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5174518Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5174904Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5176236Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5177516Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5178806Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.5180078Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.5181400Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5182701Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5182994Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:23.5183063Z Autotune Choices Stats: 2025-12-04T10:01:23.5184736Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.5185269Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5185670Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5186305Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5187731Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5189056Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5190427Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5191754Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5193119Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5194445Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5195797Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5197156Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5198478Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5199791Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5200113Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:23.5200255Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.5200328Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.5200409Z unimplemented [] 2025-12-04T10:01:23.5200520Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.5200729Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.5202157Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.5202227Z graph_break [] 2025-12-04T10:01:23.5202369Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.5202437Z Autotune Choices Stats: 2025-12-04T10:01:23.5204047Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.5204375Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5204655Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5205012Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5206298Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5207574Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5208854Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.5210162Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5211439Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.5212742Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5213030Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:23.5213101Z Autotune Choices Stats: 2025-12-04T10:01:23.5214784Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.5215334Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5215700Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5216338Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5217670Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5219034Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5220364Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5221722Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5223055Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5224418Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5225777Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5227102Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5228463Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.5229849Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5230141Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:23.5230279Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.5230353Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.5230425Z unimplemented [] 2025-12-04T10:01:23.5230533Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.5230738Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.5232175Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.5232242Z graph_break [] 2025-12-04T10:01:23.5232385Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.5232455Z Autotune Choices Stats: 2025-12-04T10:01:23.5234076Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.5234397Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5234637Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5235001Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5236291Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5237564Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5238886Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.5240158Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5241491Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.5242764Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5243089Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:23.5243205Z Autotune Choices Stats: 2025-12-04T10:01:23.5244850Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.5245367Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5245737Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5246379Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5247724Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5249077Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5250432Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5251747Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5253095Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5254440Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5255916Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5257382Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5258793Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5260114Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.5260406Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:23.5260598Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.5260674Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.5260745Z unimplemented [] 2025-12-04T10:01:23.5260855Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.5261065Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.5262463Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.5262530Z graph_break [] 2025-12-04T10:01:23.5262718Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.5262852Z Autotune Choices Stats: 2025-12-04T10:01:23.5264448Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.5264739Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5264979Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5265339Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5266621Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5267988Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5269266Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5270575Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.5271858Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.5273173Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5273496Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:23.5273565Z Autotune Choices Stats: 2025-12-04T10:01:23.5275851Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.5279243Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5280837Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5282588Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5286039Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5290385Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5294808Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5299245Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5303019Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5305916Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5308739Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5311453Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5314237Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5316987Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.5318675Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:23.5319204Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.5319521Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.5319734Z unimplemented [] 2025-12-04T10:01:23.5319946Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.5320352Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.5322088Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.5323653Z graph_break [] 2025-12-04T10:01:23.5323898Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.5324196Z Autotune Choices Stats: 2025-12-04T10:01:23.5325938Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.5327901Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5328521Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5329213Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5330955Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5333665Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5336556Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5339198Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.5341906Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.5344624Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5346278Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:23.5346717Z Autotune Choices Stats: 2025-12-04T10:01:23.5348561Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:23.5350804Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5351839Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5352924Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5354995Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5358050Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5360802Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5363576Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5366343Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5369069Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5371799Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5374585Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5377304Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5380087Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.5381778Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:23.5382298Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.5382604Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.5382813Z unimplemented [] 2025-12-04T10:01:23.5383066Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.5383478Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.5385219Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.5386755Z graph_break [] 2025-12-04T10:01:23.5386996Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.5387391Z Autotune Choices Stats: 2025-12-04T10:01:23.5389111Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.5391067Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5391695Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5392424Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5394173Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5396827Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5399489Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5402199Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.5404878Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.5407869Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5409525Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:23.5409973Z Autotune Choices Stats: 2025-12-04T10:01:23.5411752Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.5414054Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5415025Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5416113Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5418232Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5420987Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5423774Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5426547Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5429332Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5432064Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5435035Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5437765Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5440530Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5443396Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5445149Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:23.5445704Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.5446013Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.5446224Z unimplemented [] 2025-12-04T10:01:23.5446441Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.5446838Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.5448533Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.5450073Z graph_break [] 2025-12-04T10:01:23.5450322Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.5450625Z Autotune Choices Stats: 2025-12-04T10:01:23.5452358Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.5454374Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5454994Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5455879Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5457626Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5460342Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5462983Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5465680Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.5468439Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.5471108Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5472760Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:23.5473195Z Autotune Choices Stats: 2025-12-04T10:01:23.5474963Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.5477284Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5478258Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5479346Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5481445Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5484224Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5487003Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5489742Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5492468Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5495257Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5497991Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5500760Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5503486Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5506254Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.5508032Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:23.5508547Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.5508853Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.5509069Z unimplemented [] 2025-12-04T10:01:23.5509298Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.5509703Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.5511385Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.5512914Z graph_break [] 2025-12-04T10:01:23.5513150Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.5513453Z Autotune Choices Stats: 2025-12-04T10:01:23.5515174Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.5517196Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5517803Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5518496Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5520271Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5522923Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5525687Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.5528553Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.5531196Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5533831Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5535516Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:23.5535957Z Autotune Choices Stats: 2025-12-04T10:01:23.5537724Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.5539978Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5540993Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5542086Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5544157Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5546947Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5549781Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5552523Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5555395Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5558226Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5561008Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5563737Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5566531Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5569318Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.5571003Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:23.5571519Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.5571832Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.5572041Z unimplemented [] 2025-12-04T10:01:23.5572256Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.5572659Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.5574344Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.5575934Z graph_break [] 2025-12-04T10:01:23.5576169Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.5576480Z Autotune Choices Stats: 2025-12-04T10:01:23.5578204Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.5580166Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5580784Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5581512Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5583254Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5585923Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5588638Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.5591278Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.5593927Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5596607Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5598248Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:23.5598690Z Autotune Choices Stats: 2025-12-04T10:01:23.5600454Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:23.5602745Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5603726Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5604815Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5606922Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5609733Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5612477Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5615211Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5617989Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5620720Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5623475Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5626227Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5629034Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5631755Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.5633601Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:23.5634114Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.5634420Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.5634628Z unimplemented [] 2025-12-04T10:01:23.5634849Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.5635254Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.5637002Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.5638538Z graph_break [] 2025-12-04T10:01:23.5638778Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.5639084Z Autotune Choices Stats: 2025-12-04T10:01:23.5640825Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.5642837Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5643467Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5644161Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5645974Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5648653Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5651338Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.5653976Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5656810Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.5659545Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5661199Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:23.5661635Z Autotune Choices Stats: 2025-12-04T10:01:23.5663450Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.5665697Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5666669Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5667922Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5670043Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5672790Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5675540Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5678334Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5681066Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5683833Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5686565Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5689326Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5692096Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5694828Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5696517Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:23.5697034Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.5697383Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.5697588Z unimplemented [] 2025-12-04T10:01:23.5697806Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.5698217Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.5699904Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.5701436Z graph_break [] 2025-12-04T10:01:23.5701690Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.5701993Z Autotune Choices Stats: 2025-12-04T10:01:23.5703758Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.5705736Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5706354Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5707052Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5708902Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5711573Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5714228Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.5716889Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5719570Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5722212Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.5723869Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:23.5724363Z Autotune Choices Stats: 2025-12-04T10:01:23.5726333Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:23.5728723Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5729697Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5730817Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5732894Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5735736Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5738786Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5741570Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5744331Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5747424Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5750205Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5752983Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5755854Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5758597Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5760374Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:23.5760884Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.5761194Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.5761403Z unimplemented [] 2025-12-04T10:01:23.5761613Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.5762021Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.5763711Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.5765255Z graph_break [] 2025-12-04T10:01:23.5765553Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.5765855Z Autotune Choices Stats: 2025-12-04T10:01:23.5767576Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.5769545Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5770220Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5771053Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5772862Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5775604Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5778558Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.5781262Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5783920Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.5786619Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5788337Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:23.5788775Z Autotune Choices Stats: 2025-12-04T10:01:23.5790581Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:23.5792854Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5793827Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5794931Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5797007Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5799759Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5802552Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5805294Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5808068Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5810846Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5813616Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5816354Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5819078Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5821856Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5823548Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:23.5824059Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.5824356Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.5824563Z unimplemented [] 2025-12-04T10:01:23.5824780Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.5825182Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.5826917Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.5828543Z graph_break [] 2025-12-04T10:01:23.5828780Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.5829084Z Autotune Choices Stats: 2025-12-04T10:01:23.5830845Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.5832873Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5833492Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5834184Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5835946Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5838608Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5841283Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.5843971Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5846667Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.5849319Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5850969Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:23.5851407Z Autotune Choices Stats: 2025-12-04T10:01:23.5853204Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:23.5855636Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5856612Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5857715Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5859788Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5862608Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5865358Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5868234Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5870984Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5873843Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5876655Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5879398Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5882128Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5884915Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5886621Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:23.5887139Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.5887442Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.5887655Z unimplemented [] 2025-12-04T10:01:23.5887870Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.5888317Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.5890007Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.5891540Z graph_break [] 2025-12-04T10:01:23.5891769Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.5892071Z Autotune Choices Stats: 2025-12-04T10:01:23.5893837Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_574", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.5895834Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5896452Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5897134Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5898878Z triton_flex_attention_574 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5901525Z triton_flex_attention_575 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5904227Z triton_flex_attention_572 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.5906870Z triton_flex_attention_570 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5909609Z triton_flex_attention_573 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.5912291Z triton_flex_attention_571 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5913972Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3286 seconds precompiling for 6 choices 2025-12-04T10:01:23.5914410Z Autotune Choices Stats: 2025-12-04T10:01:23.5916167Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014303999952971935, "best_triton_pos": 0} 2025-12-04T10:01:23.5918412Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5919377Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5920470Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5922589Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5925335Z triton_flex_attention_backward_576 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5928093Z triton_flex_attention_backward_577 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5931008Z triton_flex_attention_backward_578 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5934586Z triton_flex_attention_backward_581 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5938192Z triton_flex_attention_backward_583 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5941078Z triton_flex_attention_backward_580 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5943826Z triton_flex_attention_backward_582 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5946614Z triton_flex_attention_backward_585 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5949442Z triton_flex_attention_backward_588 0.0174 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.5951189Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2260 seconds precompiling for 13 choices 2025-12-04T10:01:23.5951710Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.5952021Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.5952232Z unimplemented [] 2025-12-04T10:01:23.5952444Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.5952849Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.5954605Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.5956354Z graph_break [] 2025-12-04T10:01:23.5956602Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.5956903Z Autotune Choices Stats: 2025-12-04T10:01:23.5958633Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.5960605Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5961228Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5961914Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5963660Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5966438Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5969094Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.5971797Z triton_flex_attention_589 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5974445Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.5977161Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5978861Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.2961 seconds precompiling for 6 choices 2025-12-04T10:01:23.5979300Z Autotune Choices Stats: 2025-12-04T10:01:23.5981071Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:23.5983317Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.5984290Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.5985445Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.5987564Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.5990347Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.5993112Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.5995900Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.5998684Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6001415Z triton_flex_attention_backward_601 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6004154Z triton_flex_attention_backward_602 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6006950Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6009679Z triton_flex_attention_backward_604 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6012446Z triton_flex_attention_backward_603 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6014142Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.3242 seconds precompiling for 13 choices 2025-12-04T10:01:23.6014721Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:23.6015089Z Traceback (most recent call last): 2025-12-04T10:01:23.6015619Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:23.6016129Z self.assertTrue( 2025-12-04T10:01:23.6016513Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:23.6016910Z raise self.failureException(msg) 2025-12-04T10:01:23.6017388Z AssertionError: False is not true : Log file /tmp/tmpx8acg7t9/flex_attention_configs.json was not created 2025-12-04T10:01:23.6017761Z 2025-12-04T10:01:23.6017916Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:23.6018440Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:23.6018828Z 2025-12-04T10:01:23.6019003Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:23.6019413Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6019729Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6019936Z unimplemented [] 2025-12-04T10:01:23.6020160Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6021763Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:23.6023452Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6023814Z graph_break [] 2025-12-04T10:01:23.6024049Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6025484Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:23.6026810Z current_size = base.storage().size() 2025-12-04T10:01:23.6027044Z Autotune Choices Stats: 2025-12-04T10:01:23.6028837Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:23.6030797Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6031478Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6032168Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6033911Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6036601Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6039258Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6042554Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6046001Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6048753Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6050409Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:23.6050851Z Autotune Choices Stats: 2025-12-04T10:01:23.6052663Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.6054914Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6056113Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6057217Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6059359Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6062172Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6064907Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6067725Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6070519Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6073298Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6076036Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6078823Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6081584Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6084312Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6085999Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:23.6086518Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6086834Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6087042Z unimplemented [] 2025-12-04T10:01:23.6087260Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6087713Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6089415Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6090953Z graph_break [] 2025-12-04T10:01:23.6091197Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6091500Z Autotune Choices Stats: 2025-12-04T10:01:23.6093255Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6095222Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6095837Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6096527Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6098319Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6100994Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6103643Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6106274Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6109013Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6111644Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6113295Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:23.6113738Z Autotune Choices Stats: 2025-12-04T10:01:23.6115536Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.6117781Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6118793Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6119923Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6122005Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6124754Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6127477Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6130245Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6132967Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6135729Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6138505Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6141247Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6144023Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6146929Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6148709Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:23.6149289Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6149597Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6149809Z unimplemented [] 2025-12-04T10:01:23.6150025Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6150437Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6152120Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6153657Z graph_break [] 2025-12-04T10:01:23.6153797Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6153872Z Autotune Choices Stats: 2025-12-04T10:01:23.6155726Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6156038Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6156291Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6156698Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6158010Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6159341Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6160636Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6161919Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6163249Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6164543Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6164870Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:23.6164951Z Autotune Choices Stats: 2025-12-04T10:01:23.6166587Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.6167159Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6167578Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6168229Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6169560Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6170907Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6172279Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6173606Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6174988Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6176314Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6177675Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6179031Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6180354Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.6181677Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6182010Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:23.6182158Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6182236Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6182301Z unimplemented [] 2025-12-04T10:01:23.6182414Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6182626Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6184022Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6184130Z graph_break [] 2025-12-04T10:01:23.6184268Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6184343Z Autotune Choices Stats: 2025-12-04T10:01:23.6185942Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6186270Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6186520Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6186913Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6188283Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6189564Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6190847Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6192176Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6193447Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6194759Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6195042Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:23.6195117Z Autotune Choices Stats: 2025-12-04T10:01:23.6196805Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.6197365Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6197730Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6198383Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6199720Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6201047Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6202409Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6203815Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6205139Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6206517Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6207877Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6209191Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6210510Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6211867Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.6212169Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:23.6212308Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6212387Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6212452Z unimplemented [] 2025-12-04T10:01:23.6212561Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6212772Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6214199Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6214270Z graph_break [] 2025-12-04T10:01:23.6214404Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6214473Z Autotune Choices Stats: 2025-12-04T10:01:23.6216110Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6216441Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6216684Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6217046Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6218347Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6219625Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6220943Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6222234Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6223536Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6224821Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6225101Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:23.6225177Z Autotune Choices Stats: 2025-12-04T10:01:23.6226863Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.6227503Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6227875Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6228528Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6229864Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6231240Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6232562Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6233927Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6235277Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6236639Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6237961Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6239277Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6240598Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.6241953Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6242244Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:23.6242383Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6242463Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6242529Z unimplemented [] 2025-12-04T10:01:23.6242688Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6242902Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6244290Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6244365Z graph_break [] 2025-12-04T10:01:23.6244499Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6244602Z Autotune Choices Stats: 2025-12-04T10:01:23.6246375Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6246720Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6246974Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6247334Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6248639Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6249909Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6251235Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6252553Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6253827Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6255144Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6255609Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:23.6255687Z Autotune Choices Stats: 2025-12-04T10:01:23.6257333Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.6257874Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6258239Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6258891Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6260311Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6261644Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6263036Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6264372Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6265741Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6267115Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6268527Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6269855Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6271226Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6273004Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6273306Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:23.6273447Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6273524Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6273589Z unimplemented [] 2025-12-04T10:01:23.6273697Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6273911Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6275334Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6275441Z graph_break [] 2025-12-04T10:01:23.6275588Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6275667Z Autotune Choices Stats: 2025-12-04T10:01:23.6277268Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6277554Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6277805Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6278165Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6279459Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6280795Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6282079Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6283398Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6284710Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6286010Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6286323Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:23.6286401Z Autotune Choices Stats: 2025-12-04T10:01:23.6288053Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.6288579Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6288942Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6289620Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6290972Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6292338Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6293667Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6295037Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6296399Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6297734Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6299056Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6300418Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6307550Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.6309054Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6309372Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:23.6309523Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6309609Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6309677Z unimplemented [] 2025-12-04T10:01:23.6309789Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6310051Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6311450Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6311555Z graph_break [] 2025-12-04T10:01:23.6311701Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6311776Z Autotune Choices Stats: 2025-12-04T10:01:23.6313395Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:23.6313688Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6313937Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6314331Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6315639Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6316916Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6318240Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6319508Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6320813Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6322147Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6322434Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:23.6322506Z Autotune Choices Stats: 2025-12-04T10:01:23.6324153Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.6324706Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6325077Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6325728Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6327067Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6328417Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6329769Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6331125Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6332458Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6333782Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6335130Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6336454Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6337806Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6339119Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.6339439Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:23.6339583Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6339690Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6339762Z unimplemented [] 2025-12-04T10:01:23.6339871Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6340080Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6341466Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6341535Z graph_break [] 2025-12-04T10:01:23.6341672Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6341743Z Autotune Choices Stats: 2025-12-04T10:01:23.6343339Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:23.6343664Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6343911Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6344280Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6345574Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6346888Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6348214Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6349525Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6351038Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6352327Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6352611Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:23.6352682Z Autotune Choices Stats: 2025-12-04T10:01:23.6354330Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.6354914Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6355722Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6356371Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6357797Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6359119Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6360513Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6361876Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6363196Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6364519Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6365937Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6367287Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6368609Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6369960Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.6370283Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:23.6370425Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6370496Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6370572Z unimplemented [] 2025-12-04T10:01:23.6370681Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6370897Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6372287Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6372359Z graph_break [] 2025-12-04T10:01:23.6372505Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6372577Z Autotune Choices Stats: 2025-12-04T10:01:23.6374190Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6374515Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6374762Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6375118Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6376449Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6377720Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6379035Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6380346Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6381628Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6382904Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6383227Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:23.6383299Z Autotune Choices Stats: 2025-12-04T10:01:23.6384942Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.6385462Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6385843Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6386513Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6387953Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6389319Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6390675Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6392011Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6393325Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6394678Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6396004Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6397367Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6398721Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6400076Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.6400368Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:23.6400509Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6400580Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6400651Z unimplemented [] 2025-12-04T10:01:23.6400756Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6400966Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6402352Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6402455Z graph_break [] 2025-12-04T10:01:23.6402594Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6402661Z Autotune Choices Stats: 2025-12-04T10:01:23.6404254Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:23.6404542Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6404789Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6405144Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6406473Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6407780Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6409075Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6410380Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6411662Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6412937Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6413255Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:23.6413323Z Autotune Choices Stats: 2025-12-04T10:01:23.6414960Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.6415511Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6415879Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6416517Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6417879Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6419234Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6420577Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6421891Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6423241Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6424561Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6425911Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6427293Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6428667Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.6430018Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6430307Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:23.6430447Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6430518Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6430588Z unimplemented [] 2025-12-04T10:01:23.6430693Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6430898Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6432294Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6432413Z graph_break [] 2025-12-04T10:01:23.6432555Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6432623Z Autotune Choices Stats: 2025-12-04T10:01:23.6434227Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6434515Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6434796Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6435156Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6436453Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6437759Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6439072Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6440339Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6441651Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6442968Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6443263Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:23.6443332Z Autotune Choices Stats: 2025-12-04T10:01:23.6445010Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.6445534Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6445910Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6446579Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6447949Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6449278Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6450784Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6452128Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6453494Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6454852Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6456335Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6457725Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6459099Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.6460428Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6460721Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:23.6460862Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6460934Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6461058Z unimplemented [] 2025-12-04T10:01:23.6461165Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6461387Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6462797Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6462862Z graph_break [] 2025-12-04T10:01:23.6463006Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6463075Z Autotune Choices Stats: 2025-12-04T10:01:23.6464733Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:23.6465028Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6465276Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6465631Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6466967Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6468373Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6469650Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6470931Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6472252Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6473526Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6473811Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:23.6473882Z Autotune Choices Stats: 2025-12-04T10:01:23.6475557Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.6476078Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6476475Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6477143Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6478487Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6479808Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6481136Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6482509Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6483858Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6485181Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6486555Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6487907Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6489235Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6490561Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6490889Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:23.6491026Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6491096Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6491172Z unimplemented [] 2025-12-04T10:01:23.6491280Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6491486Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6492887Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6492951Z graph_break [] 2025-12-04T10:01:23.6493093Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6493165Z Autotune Choices Stats: 2025-12-04T10:01:23.6494800Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6495089Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6495336Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6495727Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6497046Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6498333Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6499616Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6500929Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6502210Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6503520Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6503807Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:23.6503873Z Autotune Choices Stats: 2025-12-04T10:01:23.6505578Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.6506133Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6506512Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6507152Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6508553Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6509882Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6511248Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6512573Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6513917Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6515269Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6516587Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6517993Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6519314Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6520649Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.6520973Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:23.6521108Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6521180Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6521253Z unimplemented [] 2025-12-04T10:01:23.6521359Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6521565Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6523005Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6523072Z graph_break [] 2025-12-04T10:01:23.6523216Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6523284Z Autotune Choices Stats: 2025-12-04T10:01:23.6524924Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6525221Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6525505Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6525860Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6527154Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6528447Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6529734Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6531053Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6532334Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6533642Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6533942Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:23.6534014Z Autotune Choices Stats: 2025-12-04T10:01:23.6535700Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.6536249Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6536612Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6537261Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6538604Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6539963Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6541289Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6542662Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6543996Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6545359Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6546712Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6548084Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6549401Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.6550753Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6551047Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:23.6551187Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6551258Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6551327Z unimplemented [] 2025-12-04T10:01:23.6551431Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6551635Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6553062Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6553126Z graph_break [] 2025-12-04T10:01:23.6553264Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6553332Z Autotune Choices Stats: 2025-12-04T10:01:23.6554955Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6555441Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6555687Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6556045Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6557340Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6558610Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6559964Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6561243Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6562569Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6563856Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6564190Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:23.6564303Z Autotune Choices Stats: 2025-12-04T10:01:23.6565956Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.6566477Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6566844Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6567487Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6568834Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6570200Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6571556Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6572882Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6574245Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6575606Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6576929Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6578250Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6579622Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6580944Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.6581226Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:23.6581403Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6581475Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6581557Z unimplemented [] 2025-12-04T10:01:23.6581667Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6581871Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6583268Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6583368Z graph_break [] 2025-12-04T10:01:23.6583513Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6583615Z Autotune Choices Stats: 2025-12-04T10:01:23.6585211Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6585500Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6585743Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6586107Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6587430Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6588761Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6590049Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6591368Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6592652Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6593958Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6594280Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:23.6594350Z Autotune Choices Stats: 2025-12-04T10:01:23.6596001Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.6596528Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6596897Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6597570Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6598918Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6600244Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6601604Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6605921Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6607290Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6608610Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6609934Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6611293Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6612611Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6613965Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.6614256Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:23.6614392Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6614461Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6614530Z unimplemented [] 2025-12-04T10:01:23.6614645Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6614857Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6616293Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6616388Z graph_break [] 2025-12-04T10:01:23.6616527Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6616593Z Autotune Choices Stats: 2025-12-04T10:01:23.6618205Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6618496Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6618733Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6619089Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6620377Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6621692Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6623046Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6624325Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6625639Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6626956Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6627297Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:23.6627368Z Autotune Choices Stats: 2025-12-04T10:01:23.6629017Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.6629537Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6629941Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6630576Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6631919Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6633276Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6634612Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6635965Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6637336Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6638665Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6639989Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6641373Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6642714Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6644050Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6644329Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:23.6644472Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6644541Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6644648Z unimplemented [] 2025-12-04T10:01:23.6644756Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6644992Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6646391Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6646454Z graph_break [] 2025-12-04T10:01:23.6646594Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6646662Z Autotune Choices Stats: 2025-12-04T10:01:23.6648259Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6648546Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6648827Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6649198Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6650481Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6651755Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6653069Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6654377Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6655870Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6657154Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6657437Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:23.6657505Z Autotune Choices Stats: 2025-12-04T10:01:23.6659151Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.6659749Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6660121Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6660760Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6662149Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6663473Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6664855Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6666236Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6667671Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6669000Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6670363Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6671687Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6673041Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.6674398Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6674694Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:23.6674888Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6674958Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6675027Z unimplemented [] 2025-12-04T10:01:23.6675131Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6675335Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6676741Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6676805Z graph_break [] 2025-12-04T10:01:23.6676943Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6677009Z Autotune Choices Stats: 2025-12-04T10:01:23.6678610Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6678951Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6679192Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6679554Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6680841Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6682155Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6683431Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6684736Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6686047Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6687314Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6687596Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:23.6687662Z Autotune Choices Stats: 2025-12-04T10:01:23.6689347Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.6689865Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6690233Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6690912Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6692256Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6693608Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6694963Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6696302Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6697621Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6698988Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6700302Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6701662Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6702982Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6704339Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.6704652Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:23.6704791Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6704861Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6704932Z unimplemented [] 2025-12-04T10:01:23.6705038Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6705240Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6706633Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6706696Z graph_break [] 2025-12-04T10:01:23.6706833Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6706899Z Autotune Choices Stats: 2025-12-04T10:01:23.6708573Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6708865Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6709114Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6709472Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6710797Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6712071Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6713394Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6714704Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6715986Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6717255Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6717572Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:23.6717639Z Autotune Choices Stats: 2025-12-04T10:01:23.6719281Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.6719800Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6720201Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6720840Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6722211Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6723538Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6724894Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6726210Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6727538Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6728895Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6730246Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6731567Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6732911Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6734266Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.6734545Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:23.6734687Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6734757Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6734840Z unimplemented [] 2025-12-04T10:01:23.6734948Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6735150Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6736544Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6736645Z graph_break [] 2025-12-04T10:01:23.6736787Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6736853Z Autotune Choices Stats: 2025-12-04T10:01:23.6738443Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6738741Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6738986Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6739384Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6740681Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6741985Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6743294Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6744576Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6745858Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6747166Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6747501Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:23.6747571Z Autotune Choices Stats: 2025-12-04T10:01:23.6749280Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:23.6749810Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6750176Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6750810Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6752195Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6753546Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6754873Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6756366Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6757759Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6759083Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6760461Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6761833Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6763197Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6764516Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.6764794Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:23.6764935Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6765014Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6765078Z unimplemented [] 2025-12-04T10:01:23.6765188Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6765389Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6766828Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6766892Z graph_break [] 2025-12-04T10:01:23.6767031Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6767099Z Autotune Choices Stats: 2025-12-04T10:01:23.6768699Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6769027Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6769265Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6769622Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6770943Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6772259Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6773534Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6774812Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6776092Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6777446Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6777729Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:23.6777799Z Autotune Choices Stats: 2025-12-04T10:01:23.6779481Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.6780002Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6780370Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6781036Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6782397Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6783730Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6785056Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6786422Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6787817Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6789370Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6790705Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6792070Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6793419Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6794746Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6795027Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:23.6795207Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6795278Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6795342Z unimplemented [] 2025-12-04T10:01:23.6795452Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6795657Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6797059Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6797131Z graph_break [] 2025-12-04T10:01:23.6797266Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6797343Z Autotune Choices Stats: 2025-12-04T10:01:23.6798998Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6799294Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6799535Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6799900Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6801227Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6802540Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6803825Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6805112Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6806440Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6807723Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6808049Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:23.6808121Z Autotune Choices Stats: 2025-12-04T10:01:23.6809772Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.6810334Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6810732Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6811386Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6812729Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6814056Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6815420Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6816740Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6818108Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6819427Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6820786Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6822158Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6823483Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6824812Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.6825128Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:23.6825270Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6825338Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6825400Z unimplemented [] 2025-12-04T10:01:23.6825508Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6825720Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6827119Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6827182Z graph_break [] 2025-12-04T10:01:23.6827413Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6827491Z Autotune Choices Stats: 2025-12-04T10:01:23.6829095Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6829388Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6829656Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6830048Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6831343Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6832626Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6833913Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6835232Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6836512Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6837834Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6838119Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:23.6838189Z Autotune Choices Stats: 2025-12-04T10:01:23.6839871Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.6840429Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6840787Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6841448Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6842783Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6844118Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6845485Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6846835Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6848167Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6849517Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6850885Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6852221Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6853542Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6855051Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.6855491Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:23.6855637Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6855706Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6855768Z unimplemented [] 2025-12-04T10:01:23.6855879Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6856096Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6857586Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6857661Z graph_break [] 2025-12-04T10:01:23.6857797Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6857868Z Autotune Choices Stats: 2025-12-04T10:01:23.6859540Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.6859878Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6860115Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6860477Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6861782Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6863070Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6864404Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6865698Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6867021Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6868395Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6868690Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:23.6868762Z Autotune Choices Stats: 2025-12-04T10:01:23.6870449Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:23.6871010Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6871380Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6872037Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6873373Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6874750Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6876091Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6877454Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6878812Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6880134Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6881494Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6882831Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6884152Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6885517Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.6885799Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:23.6885940Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6886011Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6886073Z unimplemented [] 2025-12-04T10:01:23.6886219Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6886426Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6887819Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6887881Z graph_break [] 2025-12-04T10:01:23.6888013Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6888088Z Autotune Choices Stats: 2025-12-04T10:01:23.6889721Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.6890048Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6890291Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6890661Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6891950Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6893232Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6894557Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6895891Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6897178Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6898494Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6898810Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:23.6898882Z Autotune Choices Stats: 2025-12-04T10:01:23.6900524Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.6901048Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6901408Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6902055Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6903422Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6904771Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6906140Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6907520Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6908886Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6910236Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6911567Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6912896Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6914252Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6915582Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6915898Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:23.6916040Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6916109Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6916172Z unimplemented [] 2025-12-04T10:01:23.6916282Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6916485Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6917913Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6918010Z graph_break [] 2025-12-04T10:01:23.6918147Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6918218Z Autotune Choices Stats: 2025-12-04T10:01:23.6919820Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.6920114Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6920352Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6920725Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6922019Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6923333Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6924619Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6925944Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6927258Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6928551Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6928908Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:23.6928980Z Autotune Choices Stats: 2025-12-04T10:01:23.6930633Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:23.6931157Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6931514Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6932221Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6933568Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6934936Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6936288Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6937641Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6939001Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6940337Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6941667Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6943025Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6944356Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6945729Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6946013Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:23.6946156Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6946227Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6946289Z unimplemented [] 2025-12-04T10:01:23.6946400Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6946637Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6948063Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6948170Z graph_break [] 2025-12-04T10:01:23.6948305Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6948379Z Autotune Choices Stats: 2025-12-04T10:01:23.6949992Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.6950287Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6950527Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6950891Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6952217Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6953493Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6954804Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6956375Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6957746Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6959077Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6959358Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:23.6959430Z Autotune Choices Stats: 2025-12-04T10:01:23.6961077Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:23.6961667Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6962029Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6962674Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6964011Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6965392Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6966759Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6968143Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6969472Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6970798Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6972152Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6973477Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6974832Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6976170Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6976451Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:23.6976631Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.6976735Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.6976798Z unimplemented [] 2025-12-04T10:01:23.6976906Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.6977111Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.6978504Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.6978574Z graph_break [] 2025-12-04T10:01:23.6978709Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.6978784Z Autotune Choices Stats: 2025-12-04T10:01:23.6980388Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.6980716Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6980957Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6981315Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6982609Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6983934Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6985216Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.6986552Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6987941Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.6989227Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6989504Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:23.6989578Z Autotune Choices Stats: 2025-12-04T10:01:23.6991222Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:23.6991795Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.6992157Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.6992806Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.6994173Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.6995506Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.6996863Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.6998230Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.6999563Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7000890Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7002252Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7003600Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7004927Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7006304Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7006621Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:23.7006759Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7006835Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7006896Z unimplemented [] 2025-12-04T10:01:23.7007010Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7007217Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7008620Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7008692Z graph_break [] 2025-12-04T10:01:23.7008826Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7008900Z Autotune Choices Stats: 2025-12-04T10:01:23.7010505Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_574", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.7010837Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7011084Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7011440Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7012772Z triton_flex_attention_574 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7014061Z triton_flex_attention_575 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7015371Z triton_flex_attention_572 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7016686Z triton_flex_attention_570 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7017972Z triton_flex_attention_573 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7019263Z triton_flex_attention_571 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7019543Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3286 seconds precompiling for 6 choices 2025-12-04T10:01:23.7019652Z Autotune Choices Stats: 2025-12-04T10:01:23.7021289Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014303999952971935, "best_triton_pos": 0} 2025-12-04T10:01:23.7021811Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7022176Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7022861Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7024190Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7025553Z triton_flex_attention_backward_576 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7026911Z triton_flex_attention_backward_577 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7028301Z triton_flex_attention_backward_578 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7029625Z triton_flex_attention_backward_581 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7030982Z triton_flex_attention_backward_583 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7032313Z triton_flex_attention_backward_580 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7033666Z triton_flex_attention_backward_582 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7035029Z triton_flex_attention_backward_585 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7036380Z triton_flex_attention_backward_588 0.0174 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7036658Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2260 seconds precompiling for 13 choices 2025-12-04T10:01:23.7036794Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7036868Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7036929Z unimplemented [] 2025-12-04T10:01:23.7037041Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7037243Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7038634Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7038700Z graph_break [] 2025-12-04T10:01:23.7038870Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7038946Z Autotune Choices Stats: 2025-12-04T10:01:23.7040551Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.7040843Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7041081Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7041440Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7042789Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7044109Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7045390Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7046708Z triton_flex_attention_589 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7047982Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7049272Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7049585Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.2961 seconds precompiling for 6 choices 2025-12-04T10:01:23.7049659Z Autotune Choices Stats: 2025-12-04T10:01:23.7051296Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:23.7051847Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7052206Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7052866Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7054256Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7055800Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7057126Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7058458Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7059842Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7061164Z triton_flex_attention_backward_601 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7062537Z triton_flex_attention_backward_602 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7063862Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7065242Z triton_flex_attention_backward_604 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7066635Z triton_flex_attention_backward_603 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7066919Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.3242 seconds precompiling for 13 choices 2025-12-04T10:01:23.7067054Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7067130Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7067192Z unimplemented [] 2025-12-04T10:01:23.7067350Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7067557Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7068947Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7069055Z graph_break [] 2025-12-04T10:01:23.7069193Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7069278Z Autotune Choices Stats: 2025-12-04T10:01:23.7070891Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.7071182Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7071458Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7071821Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7073120Z triton_flex_attention_612 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7074437Z triton_flex_attention_613 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7075762Z triton_flex_attention_608 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7077046Z triton_flex_attention_610 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7078333Z triton_flex_attention_611 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7079670Z triton_flex_attention_609 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7079953Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3064 seconds precompiling for 6 choices 2025-12-04T10:01:23.7080026Z Autotune Choices Stats: 2025-12-04T10:01:23.7081868Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_616", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.7082410Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7082773Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7083504Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7084846Z triton_flex_attention_backward_616 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7086215Z triton_flex_attention_backward_615 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7087549Z triton_flex_attention_backward_617 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7088877Z triton_flex_attention_backward_614 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7090250Z triton_flex_attention_backward_619 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7091605Z triton_flex_attention_backward_620 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7092940Z triton_flex_attention_backward_618 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7094299Z triton_flex_attention_backward_621 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7095663Z triton_flex_attention_backward_623 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7097000Z triton_flex_attention_backward_622 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7097283Z SingleProcess AUTOTUNE benchmarking takes 0.6701 seconds and 2.3164 seconds precompiling for 13 choices 2025-12-04T10:01:23.7097477Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:23.7097564Z Traceback (most recent call last): 2025-12-04T10:01:23.7097915Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:23.7098025Z self.assertTrue( 2025-12-04T10:01:23.7098262Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:23.7098350Z raise self.failureException(msg) 2025-12-04T10:01:23.7098637Z AssertionError: False is not true : Log file /tmp/tmpbzs82aeu/flex_attention_configs.json was not created 2025-12-04T10:01:23.7098644Z 2025-12-04T10:01:23.7098786Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:23.7099078Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:23.7099088Z 2025-12-04T10:01:23.7099263Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:23.7099405Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7099481Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7099556Z unimplemented [] 2025-12-04T10:01:23.7099668Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7101112Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:23.7101321Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7101385Z graph_break [] 2025-12-04T10:01:23.7101517Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7102708Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:23.7102799Z current_size = base.storage().size() 2025-12-04T10:01:23.7102903Z Autotune Choices Stats: 2025-12-04T10:01:23.7104517Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:23.7104806Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7105048Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7105407Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7106702Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7108085Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7109367Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7110676Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7111958Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7113272Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7113606Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:23.7113673Z Autotune Choices Stats: 2025-12-04T10:01:23.7115322Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.7115844Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7116207Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7116849Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7118222Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7119556Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7120921Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7122268Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7123591Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7124953Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7126273Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7127598Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7128954Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7130303Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7130591Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:23.7130726Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7130796Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7130873Z unimplemented [] 2025-12-04T10:01:23.7130984Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7131193Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7132627Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7132729Z graph_break [] 2025-12-04T10:01:23.7132862Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7132928Z Autotune Choices Stats: 2025-12-04T10:01:23.7134539Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7134826Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7135075Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7135431Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7136728Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7138040Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7139363Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7140641Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7141951Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7143266Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7143551Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:23.7143621Z Autotune Choices Stats: 2025-12-04T10:01:23.7145266Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.7145786Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7146191Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7146838Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7148236Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7149607Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7150930Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7152318Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7153670Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7154995Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7156478Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7157873Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7159199Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7160558Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7160850Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:23.7160988Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7161061Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7161133Z unimplemented [] 2025-12-04T10:01:23.7161284Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7161500Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7162929Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7162991Z graph_break [] 2025-12-04T10:01:23.7163133Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7163200Z Autotune Choices Stats: 2025-12-04T10:01:23.7164804Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7165091Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7165335Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7165742Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7167034Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7168309Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7169622Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7170939Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7172219Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7173528Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7173813Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:23.7173882Z Autotune Choices Stats: 2025-12-04T10:01:23.7175524Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.7176084Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7176452Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7177093Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7178471Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7179803Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7181170Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7182562Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7183892Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7185219Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7186598Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7187962Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7189340Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7190656Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7190977Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:23.7191146Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7191217Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7191286Z unimplemented [] 2025-12-04T10:01:23.7191391Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7191593Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7192985Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7193058Z graph_break [] 2025-12-04T10:01:23.7193202Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7193271Z Autotune Choices Stats: 2025-12-04T10:01:23.7194869Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7195190Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7195439Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7195795Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7197081Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7198402Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7199688Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7200999Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7202324Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7203599Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7203886Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:23.7203955Z Autotune Choices Stats: 2025-12-04T10:01:23.7205682Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.7206346Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7206782Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7207542Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7208921Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7210283Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7211647Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7212990Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7214308Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7215635Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7216991Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7218348Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7219675Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7221026Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7221344Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:23.7221478Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7221546Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7221617Z unimplemented [] 2025-12-04T10:01:23.7221722Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7221925Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7223319Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7223387Z graph_break [] 2025-12-04T10:01:23.7223536Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7223606Z Autotune Choices Stats: 2025-12-04T10:01:23.7225198Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7225541Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7225784Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7226137Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7227532Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7228827Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7230163Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7231473Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7232759Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7234036Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7234355Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:23.7234422Z Autotune Choices Stats: 2025-12-04T10:01:23.7236069Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.7236588Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7237036Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7237683Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7239012Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7240374Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7241731Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7243071Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7244383Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7245744Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7247057Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7248435Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7249781Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7251133Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7251417Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:23.7251548Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7251620Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7251687Z unimplemented [] 2025-12-04T10:01:23.7251792Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7251999Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7253388Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7253494Z graph_break [] 2025-12-04T10:01:23.7253633Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7253699Z Autotune Choices Stats: 2025-12-04T10:01:23.7255552Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7255852Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7256105Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7256464Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7257828Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7259163Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7260518Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7261796Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7263076Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7264352Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7264689Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:23.7264756Z Autotune Choices Stats: 2025-12-04T10:01:23.7266417Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.7266979Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7267398Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7268041Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7269427Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7270785Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7272119Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7273445Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7274812Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7276151Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7277512Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7278840Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7280193Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7281551Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7281838Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:23.7281975Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7282045Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7282114Z unimplemented [] 2025-12-04T10:01:23.7282219Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7282423Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7283822Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7283920Z graph_break [] 2025-12-04T10:01:23.7284059Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7284130Z Autotune Choices Stats: 2025-12-04T10:01:23.7285735Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7286027Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7286309Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7286676Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7287980Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7289307Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7290620Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7291905Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7293189Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7294502Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7294790Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:23.7294857Z Autotune Choices Stats: 2025-12-04T10:01:23.7296753Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.7297278Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7297647Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7298344Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7299718Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7301044Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7302383Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7303743Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7305076Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7306458Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7307861Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7309229Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7310582Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7311915Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7312206Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:23.7312341Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7312447Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7312519Z unimplemented [] 2025-12-04T10:01:23.7312625Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7312830Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7314218Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7314280Z graph_break [] 2025-12-04T10:01:23.7314422Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7314490Z Autotune Choices Stats: 2025-12-04T10:01:23.7316142Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:23.7316436Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7316676Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7317047Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7318375Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7319693Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7320985Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7322265Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7323589Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7324863Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7325147Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:23.7325213Z Autotune Choices Stats: 2025-12-04T10:01:23.7327227Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.7331385Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7331814Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7332456Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7333800Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7335157Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7336480Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7337792Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7339165Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7340479Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7341818Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7343262Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7344573Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7345894Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7346185Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:23.7346337Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7346412Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7346481Z unimplemented [] 2025-12-04T10:01:23.7346590Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7346793Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7348262Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7348332Z graph_break [] 2025-12-04T10:01:23.7348475Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7348544Z Autotune Choices Stats: 2025-12-04T10:01:23.7350188Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:23.7350488Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7350852Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7351217Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7352502Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7353768Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7355037Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7356606Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7357886Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7359240Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7359534Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:23.7359604Z Autotune Choices Stats: 2025-12-04T10:01:23.7361285Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.7361913Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7362280Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7362912Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7364239Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7365547Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7366875Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7368177Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7369533Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7370889Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7372284Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7373601Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7374911Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7376223Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7376505Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:23.7376662Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7376734Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7376804Z unimplemented [] 2025-12-04T10:01:23.7376912Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7377116Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7378551Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7378615Z graph_break [] 2025-12-04T10:01:23.7378757Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7378823Z Autotune Choices Stats: 2025-12-04T10:01:23.7380441Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7380803Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7381040Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7381397Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7382681Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7383952Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7385220Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7386732Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7388223Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7389500Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7389784Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:23.7389923Z Autotune Choices Stats: 2025-12-04T10:01:23.7391646Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.7392171Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7392539Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7393175Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7394507Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7395945Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7397439Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7398785Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7400099Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7401514Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7402819Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7404135Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7405440Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7406762Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7407041Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:23.7407194Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7407270Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7407332Z unimplemented [] 2025-12-04T10:01:23.7407447Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7407687Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7409079Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7409140Z graph_break [] 2025-12-04T10:01:23.7409278Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7409434Z Autotune Choices Stats: 2025-12-04T10:01:23.7411050Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:23.7411341Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7411577Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7411940Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7413220Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7414483Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7415759Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7417047Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7418356Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7419655Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7420009Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:23.7420075Z Autotune Choices Stats: 2025-12-04T10:01:23.7421699Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.7422218Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7422586Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7423213Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7424535Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7425855Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7427295Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7428615Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7430043Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7431358Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7432671Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7433980Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7435290Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7436594Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7436876Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:23.7442597Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7442715Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7442793Z unimplemented [] 2025-12-04T10:01:23.7442917Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7443133Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7444581Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7444725Z graph_break [] 2025-12-04T10:01:23.7444878Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7444950Z Autotune Choices Stats: 2025-12-04T10:01:23.7446568Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7446861Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7447115Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7447467Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7448760Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7450037Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7451309Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7452634Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7453913Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7455588Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7455916Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:23.7455989Z Autotune Choices Stats: 2025-12-04T10:01:23.7457788Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.7458320Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7458690Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7459334Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7460681Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7462002Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7463401Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7464765Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7466196Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7467581Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7468895Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7470220Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7471525Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7472885Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7473179Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:23.7473321Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7473394Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7473465Z unimplemented [] 2025-12-04T10:01:23.7473574Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7473857Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7475286Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7475351Z graph_break [] 2025-12-04T10:01:23.7475495Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7475563Z Autotune Choices Stats: 2025-12-04T10:01:23.7477166Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:23.7477466Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7477709Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7478065Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7479345Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7480612Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7481924Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7483195Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7484571Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7485847Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7486131Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:23.7486198Z Autotune Choices Stats: 2025-12-04T10:01:23.7487834Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.7488350Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7488719Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7489350Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7490677Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7492059Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7493368Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7494782Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7496093Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7497410Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7498715Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7500036Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7501440Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7502767Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7503055Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:23.7503290Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7503393Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7503464Z unimplemented [] 2025-12-04T10:01:23.7503573Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7503778Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7505185Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7505251Z graph_break [] 2025-12-04T10:01:23.7505395Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7505462Z Autotune Choices Stats: 2025-12-04T10:01:23.7507063Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7507406Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7507662Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7508012Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7509294Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7510610Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7511884Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7513187Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7514524Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7515790Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7516088Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:23.7516156Z Autotune Choices Stats: 2025-12-04T10:01:23.7517787Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.7518310Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7518677Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7519313Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7520685Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7522011Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7523436Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7524753Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7526076Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7527392Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7528706Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7530040Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7531391Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7532761Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7533110Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:23.7533246Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7533315Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7533383Z unimplemented [] 2025-12-04T10:01:23.7533487Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7533691Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7535080Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7535144Z graph_break [] 2025-12-04T10:01:23.7535286Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7535353Z Autotune Choices Stats: 2025-12-04T10:01:23.7537247Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7537591Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7537861Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7538215Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7539498Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7540803Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7542105Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7543438Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7544714Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7545989Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7546270Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:23.7546337Z Autotune Choices Stats: 2025-12-04T10:01:23.7548008Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.7548530Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7548894Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7549573Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7550898Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7552256Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7553684Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7555002Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7556510Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7557837Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7559148Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7560550Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7561859Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7563330Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7563627Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:23.7563770Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7563844Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7563919Z unimplemented [] 2025-12-04T10:01:23.7564030Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7564238Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7565636Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7565699Z graph_break [] 2025-12-04T10:01:23.7565839Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7565909Z Autotune Choices Stats: 2025-12-04T10:01:23.7567512Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7567800Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7568039Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7568399Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7569731Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7571002Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7572381Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7573649Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7574931Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7576595Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7577005Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:23.7577080Z Autotune Choices Stats: 2025-12-04T10:01:23.7578725Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.7579255Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7579687Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7580329Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7581700Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7583080Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7584397Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7585702Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7587023Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7588410Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7589752Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7591070Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7592473Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7593783Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7594071Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:23.7594220Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7594293Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7594367Z unimplemented [] 2025-12-04T10:01:23.7594476Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7594682Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7596093Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7596159Z graph_break [] 2025-12-04T10:01:23.7596303Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7596374Z Autotune Choices Stats: 2025-12-04T10:01:23.7597964Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7598252Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7598492Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7598888Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7600160Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7601536Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7602805Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7604078Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7605340Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7606622Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7606906Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:23.7606985Z Autotune Choices Stats: 2025-12-04T10:01:23.7608661Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.7609184Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7609545Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7610174Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7611605Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7612921Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7614244Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7615558Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7616881Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7618232Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7619542Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7620892Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7622253Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7623569Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7623856Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:23.7623997Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7624067Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7624136Z unimplemented [] 2025-12-04T10:01:23.7624241Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7624445Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7625849Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7625922Z graph_break [] 2025-12-04T10:01:23.7626064Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7626131Z Autotune Choices Stats: 2025-12-04T10:01:23.7627801Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7628098Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7628336Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7628691Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7630008Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7631342Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7632620Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7633894Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7635160Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7636426Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7636717Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:23.7636789Z Autotune Choices Stats: 2025-12-04T10:01:23.7638474Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.7638993Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7639451Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7640087Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7641411Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7642721Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7644047Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7645359Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7646682Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7648032Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7649373Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7650746Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7652051Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7653372Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7653649Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:23.7653794Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7653864Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7653944Z unimplemented [] 2025-12-04T10:01:23.7654052Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7654258Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7655851Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7655918Z graph_break [] 2025-12-04T10:01:23.7656056Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7656127Z Autotune Choices Stats: 2025-12-04T10:01:23.7657781Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7658076Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7658313Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7658784Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7660103Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7661373Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7662648Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7663922Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7665199Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7666469Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7666790Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:23.7666861Z Autotune Choices Stats: 2025-12-04T10:01:23.7668569Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.7669210Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7669577Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7670215Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7671545Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7672863Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7674189Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7675506Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7676877Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7678192Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7679592Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7680904Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7682220Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7683535Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7683815Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:23.7683955Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7684024Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7684086Z unimplemented [] 2025-12-04T10:01:23.7684198Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7684399Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7685803Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7685903Z graph_break [] 2025-12-04T10:01:23.7686045Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7686114Z Autotune Choices Stats: 2025-12-04T10:01:23.7687697Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7688080Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7688320Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7688692Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7689970Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7691237Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7692504Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7693775Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7695049Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7696359Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7696647Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:23.7696713Z Autotune Choices Stats: 2025-12-04T10:01:23.7698390Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.7698971Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7699335Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7699971Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7701299Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7702610Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7703936Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7705277Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7706599Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7708041Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7709438Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7710756Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7712082Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7713396Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7713673Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:23.7713816Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7713885Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7713948Z unimplemented [] 2025-12-04T10:01:23.7714057Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7714261Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7715698Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7715763Z graph_break [] 2025-12-04T10:01:23.7715902Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7715969Z Autotune Choices Stats: 2025-12-04T10:01:23.7717585Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7717938Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7718177Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7718532Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7719822Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7721094Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7722363Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7723641Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7724960Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7726229Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7726578Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:23.7726683Z Autotune Choices Stats: 2025-12-04T10:01:23.7728327Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.7728846Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7729216Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7729845Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7731161Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7732494Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7733815Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7735176Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7736527Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7737899Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7739217Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7740534Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7741835Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7743149Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7743432Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:23.7743574Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7743646Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7743744Z unimplemented [] 2025-12-04T10:01:23.7743855Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7744057Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7745450Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7745604Z graph_break [] 2025-12-04T10:01:23.7745740Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7745845Z Autotune Choices Stats: 2025-12-04T10:01:23.7747521Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7747819Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7748059Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7748421Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7749701Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7750971Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7752236Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7753541Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7754819Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7756396Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7756785Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:23.7756852Z Autotune Choices Stats: 2025-12-04T10:01:23.7758498Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:23.7759024Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7759387Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7760022Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7761362Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7762684Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7764070Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7765385Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7766804Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7768115Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7769424Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7770740Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7772051Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7773393Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7773681Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:23.7773826Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7773897Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7773963Z unimplemented [] 2025-12-04T10:01:23.7774078Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7774287Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7775716Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7775843Z graph_break [] 2025-12-04T10:01:23.7775984Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7776070Z Autotune Choices Stats: 2025-12-04T10:01:23.7777661Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7777958Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7778195Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7778555Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7779840Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7781111Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7782379Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7783713Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7785008Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7786351Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7786640Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:23.7786709Z Autotune Choices Stats: 2025-12-04T10:01:23.7788409Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.7788932Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7789293Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7789934Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7791258Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7792618Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7793947Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7795383Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7796957Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7798315Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7799625Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7800949Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7802257Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7803611Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7803894Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:23.7804036Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7804177Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7804243Z unimplemented [] 2025-12-04T10:01:23.7804367Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7804607Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7806003Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7806066Z graph_break [] 2025-12-04T10:01:23.7806203Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7806277Z Autotune Choices Stats: 2025-12-04T10:01:23.7807864Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7808157Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7808399Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7808753Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7810039Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7811309Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7812825Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7814110Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7815505Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7816793Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7817078Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:23.7817147Z Autotune Choices Stats: 2025-12-04T10:01:23.7818773Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.7819305Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7819668Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7820302Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7821625Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7822978Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7824320Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7825682Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7826991Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7828356Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7829674Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7830986Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7832335Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7833648Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7834024Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:23.7834169Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7834238Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7834303Z unimplemented [] 2025-12-04T10:01:23.7834412Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7834615Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7836016Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7836080Z graph_break [] 2025-12-04T10:01:23.7836218Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7836293Z Autotune Choices Stats: 2025-12-04T10:01:23.7837869Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7838166Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7838404Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7838758Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7840032Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7841343Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7842610Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7843984Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7845247Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7846538Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7846819Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:23.7846887Z Autotune Choices Stats: 2025-12-04T10:01:23.7848511Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.7849037Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7849397Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7850029Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7851393Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7852752Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7854147Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7855708Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7857034Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7858343Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7859672Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7861103Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7862579Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7864090Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7864381Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:23.7864526Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7864596Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7864659Z unimplemented [] 2025-12-04T10:01:23.7864770Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7864973Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7866383Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7866448Z graph_break [] 2025-12-04T10:01:23.7866584Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7866660Z Autotune Choices Stats: 2025-12-04T10:01:23.7868320Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.7868621Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7868860Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7869220Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7870547Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7871829Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7873136Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7874478Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7875752Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7877037Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7877315Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:23.7877392Z Autotune Choices Stats: 2025-12-04T10:01:23.7879017Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:23.7879541Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7879902Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7880577Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7881903Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7883321Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7884644Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7885972Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7887293Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7888612Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7889925Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7891276Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7892638Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7894009Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.7894288Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:23.7894429Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7894497Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7894561Z unimplemented [] 2025-12-04T10:01:23.7894682Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7894890Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7896289Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7896350Z graph_break [] 2025-12-04T10:01:23.7896485Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7896560Z Autotune Choices Stats: 2025-12-04T10:01:23.7898141Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.7898434Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7898669Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7899031Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7900342Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7901643Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7902983Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7904266Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7905537Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7906826Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7907104Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:23.7907179Z Autotune Choices Stats: 2025-12-04T10:01:23.7908839Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.7909402Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7909762Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7910396Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7911752Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7913139Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7914445Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7915764Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7917085Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7918396Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7919762Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7921073Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7922483Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7923804Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7924084Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:23.7924225Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7924294Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7924358Z unimplemented [] 2025-12-04T10:01:23.7924468Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7924669Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7926068Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7926141Z graph_break [] 2025-12-04T10:01:23.7926276Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7926350Z Autotune Choices Stats: 2025-12-04T10:01:23.7927936Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.7928230Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7928518Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7928876Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7930151Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7931517Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7932784Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7934060Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7935349Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7936866Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7937200Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:23.7937301Z Autotune Choices Stats: 2025-12-04T10:01:23.7938987Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:23.7939507Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7939870Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7940613Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7941941Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7943268Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7944587Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7946041Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7947608Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7948956Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7950271Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7951672Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7952987Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7954312Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7954589Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:23.7954734Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7954804Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7954867Z unimplemented [] 2025-12-04T10:01:23.7954989Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7955348Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7956801Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7956869Z graph_break [] 2025-12-04T10:01:23.7957004Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7957077Z Autotune Choices Stats: 2025-12-04T10:01:23.7958732Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.7959033Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7959272Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7959617Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7961077Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7962479Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7963759Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7965028Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7966309Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7967589Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7967871Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:23.7967946Z Autotune Choices Stats: 2025-12-04T10:01:23.7969624Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:23.7970160Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7970666Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7971307Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7972625Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7973942Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7975248Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7976566Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7977914Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7979226Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.7980574Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.7981938Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7983258Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7984572Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.7984854Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:23.7984990Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.7985075Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.7985141Z unimplemented [] 2025-12-04T10:01:23.7985252Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.7985459Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.7986846Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.7986916Z graph_break [] 2025-12-04T10:01:23.7987049Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.7987125Z Autotune Choices Stats: 2025-12-04T10:01:23.7988840Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.7989134Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.7989471Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.7989825Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.7991118Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7992394Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7993668Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.7994942Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7996224Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.7997540Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.7997823Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:23.7997897Z Autotune Choices Stats: 2025-12-04T10:01:23.7999552Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:23.8000162Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8000523Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8001160Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8002492Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8003811Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8005137Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8006468Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8007827Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8009175Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8010702Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8012029Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8013345Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8014672Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8014963Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:23.8015102Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8015181Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8015249Z unimplemented [] 2025-12-04T10:01:23.8015362Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8015573Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8016998Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8017072Z graph_break [] 2025-12-04T10:01:23.8017208Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8017284Z Autotune Choices Stats: 2025-12-04T10:01:23.8018920Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_574", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.8019287Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8019521Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8019869Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8021153Z triton_flex_attention_574 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8022430Z triton_flex_attention_575 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8023712Z triton_flex_attention_572 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8024988Z triton_flex_attention_570 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8026287Z triton_flex_attention_573 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8027621Z triton_flex_attention_571 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8027902Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3286 seconds precompiling for 6 choices 2025-12-04T10:01:23.8028049Z Autotune Choices Stats: 2025-12-04T10:01:23.8029718Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014303999952971935, "best_triton_pos": 0} 2025-12-04T10:01:23.8030257Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8030619Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8031264Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8032592Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8033915Z triton_flex_attention_backward_576 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8035227Z triton_flex_attention_backward_577 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8036585Z triton_flex_attention_backward_578 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8037900Z triton_flex_attention_backward_581 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8039333Z triton_flex_attention_backward_583 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8040649Z triton_flex_attention_backward_580 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8041966Z triton_flex_attention_backward_582 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8043290Z triton_flex_attention_backward_585 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8044611Z triton_flex_attention_backward_588 0.0174 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8044891Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2260 seconds precompiling for 13 choices 2025-12-04T10:01:23.8045028Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8045109Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8045173Z unimplemented [] 2025-12-04T10:01:23.8045278Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8045519Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8047043Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8047147Z graph_break [] 2025-12-04T10:01:23.8047341Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8047555Z Autotune Choices Stats: 2025-12-04T10:01:23.8049834Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.8050256Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8050604Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8051106Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8052955Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8054765Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8056883Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8058735Z triton_flex_attention_589 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8060890Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8062982Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8063410Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.2961 seconds precompiling for 6 choices 2025-12-04T10:01:23.8063490Z Autotune Choices Stats: 2025-12-04T10:01:23.8065152Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:23.8065688Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8066071Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8066719Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8068174Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8069543Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8070942Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8072279Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8073723Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8075050Z triton_flex_attention_backward_601 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8076383Z triton_flex_attention_backward_602 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8077697Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8079030Z triton_flex_attention_backward_604 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8080353Z triton_flex_attention_backward_603 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8080641Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.3242 seconds precompiling for 13 choices 2025-12-04T10:01:23.8080844Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8080928Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8080995Z unimplemented [] 2025-12-04T10:01:23.8081112Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8081326Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8082722Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8082891Z graph_break [] 2025-12-04T10:01:23.8083033Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8083107Z Autotune Choices Stats: 2025-12-04T10:01:23.8084711Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.8085011Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8085259Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8085625Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8086934Z triton_flex_attention_612 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8088218Z triton_flex_attention_613 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8089502Z triton_flex_attention_608 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8090832Z triton_flex_attention_610 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8092125Z triton_flex_attention_611 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8093508Z triton_flex_attention_609 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8093792Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3064 seconds precompiling for 6 choices 2025-12-04T10:01:23.8093867Z Autotune Choices Stats: 2025-12-04T10:01:23.8095520Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_616", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.8096054Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8096417Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8097061Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8098412Z triton_flex_attention_backward_616 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8099758Z triton_flex_attention_backward_615 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8101118Z triton_flex_attention_backward_617 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8102483Z triton_flex_attention_backward_614 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8103872Z triton_flex_attention_backward_619 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8105204Z triton_flex_attention_backward_620 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8106533Z triton_flex_attention_backward_618 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8107906Z triton_flex_attention_backward_621 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8109247Z triton_flex_attention_backward_623 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8110611Z triton_flex_attention_backward_622 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8110903Z SingleProcess AUTOTUNE benchmarking takes 0.6701 seconds and 2.3164 seconds precompiling for 13 choices 2025-12-04T10:01:23.8111040Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8111118Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8111182Z unimplemented [] 2025-12-04T10:01:23.8111291Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8111566Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8113018Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8113087Z graph_break [] 2025-12-04T10:01:23.8113222Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8113290Z Autotune Choices Stats: 2025-12-04T10:01:23.8114901Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.8115196Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8115439Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8115801Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8117098Z triton_flex_attention_631 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8118379Z triton_flex_attention_632 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8119705Z triton_flex_attention_629 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8121012Z triton_flex_attention_627 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8122391Z triton_flex_attention_630 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8123675Z triton_flex_attention_628 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8123961Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.2948 seconds precompiling for 6 choices 2025-12-04T10:01:23.8124035Z Autotune Choices Stats: 2025-12-04T10:01:23.8125681Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:23.8126213Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8126583Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8127237Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8128588Z triton_flex_attention_backward_634 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8129965Z triton_flex_attention_backward_635 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8131289Z triton_flex_attention_backward_636 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8132730Z triton_flex_attention_backward_633 0.0144 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8134056Z triton_flex_attention_backward_638 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8135383Z triton_flex_attention_backward_640 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8136887Z triton_flex_attention_backward_637 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8138231Z triton_flex_attention_backward_639 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8139621Z triton_flex_attention_backward_642 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8140954Z triton_flex_attention_backward_641 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8141246Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.2589 seconds precompiling for 13 choices 2025-12-04T10:01:23.8141509Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:23.8141601Z Traceback (most recent call last): 2025-12-04T10:01:23.8141989Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:23.8142059Z self.assertTrue( 2025-12-04T10:01:23.8142295Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:23.8142378Z raise self.failureException(msg) 2025-12-04T10:01:23.8142666Z AssertionError: False is not true : Log file /tmp/tmp0t2ys9b1/flex_attention_configs.json was not created 2025-12-04T10:01:23.8142672Z 2025-12-04T10:01:23.8142824Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:23.8143122Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:23.8143130Z 2025-12-04T10:01:23.8143320Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:23.8143459Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8143538Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8143602Z unimplemented [] 2025-12-04T10:01:23.8143715Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8145118Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:23.8145326Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8145399Z graph_break [] 2025-12-04T10:01:23.8145533Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8146707Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:23.8146801Z current_size = base.storage().size() 2025-12-04T10:01:23.8146872Z Autotune Choices Stats: 2025-12-04T10:01:23.8148587Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:23.8148883Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8149131Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8149495Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8150830Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8152187Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8153482Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8154763Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8156201Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8157487Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8157773Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:23.8157846Z Autotune Choices Stats: 2025-12-04T10:01:23.8159571Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.8160098Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8160463Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8161254Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8162606Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8163942Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8165264Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8166588Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8167903Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8169264Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8170586Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8172051Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8173387Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8174701Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8174986Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:23.8175126Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8175202Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8175270Z unimplemented [] 2025-12-04T10:01:23.8175382Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8175590Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8176992Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8177053Z graph_break [] 2025-12-04T10:01:23.8177197Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8177266Z Autotune Choices Stats: 2025-12-04T10:01:23.8178907Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8179196Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8179452Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8179878Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8181203Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8182482Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8183767Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8185040Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8186522Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8187956Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8188295Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:23.8188366Z Autotune Choices Stats: 2025-12-04T10:01:23.8190023Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.8190658Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8191031Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8191675Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8193012Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8194340Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8195662Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8196985Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8198348Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8199675Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8201087Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8202408Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8203730Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8205060Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8205352Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:23.8205489Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8205559Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8205629Z unimplemented [] 2025-12-04T10:01:23.8205733Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8205945Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8207344Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8207411Z graph_break [] 2025-12-04T10:01:23.8207586Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8207656Z Autotune Choices Stats: 2025-12-04T10:01:23.8209251Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8209609Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8209895Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8210253Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8211538Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8212825Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8214108Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8215386Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8216659Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8217981Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8218268Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:23.8218336Z Autotune Choices Stats: 2025-12-04T10:01:23.8220016Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.8220596Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8220963Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8221608Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8222955Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8224285Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8225624Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8226944Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8228388Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8229753Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8231131Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8232460Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8233782Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8235107Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8235397Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:23.8235534Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8235606Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8235677Z unimplemented [] 2025-12-04T10:01:23.8235779Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8235995Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8237600Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8237675Z graph_break [] 2025-12-04T10:01:23.8237825Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8237893Z Autotune Choices Stats: 2025-12-04T10:01:23.8239537Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8239889Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8240137Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8240497Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8241796Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8243083Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8244369Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8245659Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8246970Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8248243Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8248527Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:23.8248655Z Autotune Choices Stats: 2025-12-04T10:01:23.8250326Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.8250851Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8251221Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8251861Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8253200Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8254538Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8256010Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8257398Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8258720Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8260205Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8261534Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8262862Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8264187Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8265512Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8265799Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:23.8265938Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8266010Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8266081Z unimplemented [] 2025-12-04T10:01:23.8266188Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8266429Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8267881Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8267944Z graph_break [] 2025-12-04T10:01:23.8268150Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8268220Z Autotune Choices Stats: 2025-12-04T10:01:23.8269854Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8270142Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8270384Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8270751Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8272043Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8273330Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8274613Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8275897Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8277203Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8278674Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8279036Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:23.8279108Z Autotune Choices Stats: 2025-12-04T10:01:23.8280766Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.8281300Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8281673Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8282317Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8283673Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8284998Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8286368Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8287688Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8289132Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8290461Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8291783Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8293107Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8294427Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8295754Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8296073Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:23.8296219Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8296308Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8296384Z unimplemented [] 2025-12-04T10:01:23.8296496Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8296702Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8298129Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8298286Z graph_break [] 2025-12-04T10:01:23.8298431Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8298504Z Autotune Choices Stats: 2025-12-04T10:01:23.8300100Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8300390Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8300634Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8301002Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8302291Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8303586Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8304873Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8306194Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8307554Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8308928Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8309219Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:23.8309287Z Autotune Choices Stats: 2025-12-04T10:01:23.8310943Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.8311465Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8311830Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8312477Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8313810Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8315184Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8316518Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8317878Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8319269Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8320603Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8321934Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8323254Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8324574Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8326415Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8326706Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:23.8326849Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8326919Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8326987Z unimplemented [] 2025-12-04T10:01:23.8327207Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8327413Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8328841Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8328905Z graph_break [] 2025-12-04T10:01:23.8329047Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8329115Z Autotune Choices Stats: 2025-12-04T10:01:23.8330715Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8331003Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8331245Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8331607Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8332903Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8334185Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8335508Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8336798Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8338199Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8339477Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8339763Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:23.8339836Z Autotune Choices Stats: 2025-12-04T10:01:23.8341479Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.8342006Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8342380Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8343027Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8344370Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8345742Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8347114Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8348615Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8349942Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8351266Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8352586Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8353914Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8355474Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8356833Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8357210Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:23.8357400Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8357472Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8357543Z unimplemented [] 2025-12-04T10:01:23.8357649Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8357852Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8359253Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8359319Z graph_break [] 2025-12-04T10:01:23.8359460Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8359528Z Autotune Choices Stats: 2025-12-04T10:01:23.8361125Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:23.8361419Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8361667Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8362033Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8363329Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8364664Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8365946Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8367259Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8368604Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8369883Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8370173Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:23.8370251Z Autotune Choices Stats: 2025-12-04T10:01:23.8371906Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.8372433Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8372799Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8373445Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8374850Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8376181Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8377608Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8378942Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8380425Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8381758Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8383077Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8384406Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8385788Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8387151Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8387549Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:23.8387693Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8387764Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8387835Z unimplemented [] 2025-12-04T10:01:23.8387942Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8388145Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8389535Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8389601Z graph_break [] 2025-12-04T10:01:23.8389740Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8389808Z Autotune Choices Stats: 2025-12-04T10:01:23.8391416Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:23.8391711Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8391955Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8392315Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8393617Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8394937Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8396257Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8397599Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8398892Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8400173Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8400455Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:23.8400521Z Autotune Choices Stats: 2025-12-04T10:01:23.8402173Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.8402692Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8403059Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8403753Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8405098Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8406459Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8407875Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8409198Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8410517Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8411848Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8413171Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8414541Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8415878Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8417302Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8417582Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:23.8417722Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8417795Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8417859Z unimplemented [] 2025-12-04T10:01:23.8417972Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8418176Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8419566Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8419628Z graph_break [] 2025-12-04T10:01:23.8419769Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8419841Z Autotune Choices Stats: 2025-12-04T10:01:23.8421445Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8421734Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8421976Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8422350Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8423686Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8424966Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8426345Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8427680Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8428973Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8430260Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8430550Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:23.8430616Z Autotune Choices Stats: 2025-12-04T10:01:23.8432266Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.8432790Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8433202Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8433850Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8435230Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8436617Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8437944Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8439265Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8440603Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8441932Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8443299Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8444623Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8446086Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8447416Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8447699Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:23.8447840Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8447910Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8447972Z unimplemented [] 2025-12-04T10:01:23.8448085Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8448288Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8449696Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8449762Z graph_break [] 2025-12-04T10:01:23.8449905Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8449973Z Autotune Choices Stats: 2025-12-04T10:01:23.8451585Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:23.8451882Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8452165Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8452529Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8453824Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8455387Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8456712Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8458011Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8459299Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8460588Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8460877Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:23.8460947Z Autotune Choices Stats: 2025-12-04T10:01:23.8462645Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.8463181Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8463540Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8464188Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8465670Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8467006Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8468378Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8469699Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8471035Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8472402Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8473734Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8475095Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8476475Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8477808Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8478089Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:23.8478229Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8478298Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8478361Z unimplemented [] 2025-12-04T10:01:23.8478472Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8478677Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8480082Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8480145Z graph_break [] 2025-12-04T10:01:23.8480277Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8480349Z Autotune Choices Stats: 2025-12-04T10:01:23.8482020Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8482310Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8482551Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8482914Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8484265Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8485612Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8486895Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8488184Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8489461Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8490749Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8491031Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:23.8491111Z Autotune Choices Stats: 2025-12-04T10:01:23.8492793Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.8493314Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8493771Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8494423Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8495763Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8497107Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8498435Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8499763Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8501088Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8502438Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8503794Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8505172Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8506497Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8507870Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8508151Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:23.8508295Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8508365Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8508429Z unimplemented [] 2025-12-04T10:01:23.8508540Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8508744Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8510145Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8510207Z graph_break [] 2025-12-04T10:01:23.8510343Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8510418Z Autotune Choices Stats: 2025-12-04T10:01:23.8512059Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:23.8512350Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8512591Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8513049Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8514346Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8515642Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8516928Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8518220Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8519506Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8520860Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8521151Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:23.8521221Z Autotune Choices Stats: 2025-12-04T10:01:23.8522857Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.8523483Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8523847Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8524498Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8525846Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8527177Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8528519Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8529841Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8531207Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8532528Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8533949Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8535280Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8536614Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8537950Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8538236Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:23.8538379Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8538448Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8538511Z unimplemented [] 2025-12-04T10:01:23.8538620Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8538824Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8540261Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8540327Z graph_break [] 2025-12-04T10:01:23.8540459Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8540531Z Autotune Choices Stats: 2025-12-04T10:01:23.8542130Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8542531Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8542774Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8543134Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8544422Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8545715Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8546996Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8548339Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8549616Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8550950Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8551235Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:23.8551302Z Autotune Choices Stats: 2025-12-04T10:01:23.8552985Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.8553585Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8553946Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8554602Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8556078Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8557419Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8558749Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8560125Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8561457Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8562916Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8564259Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8565595Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8566912Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8568241Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8568521Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:23.8568662Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8568733Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8568797Z unimplemented [] 2025-12-04T10:01:23.8568910Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8569116Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8570555Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8570621Z graph_break [] 2025-12-04T10:01:23.8570753Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8570826Z Autotune Choices Stats: 2025-12-04T10:01:23.8572459Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8572812Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8573055Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8573419Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8574724Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8576006Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8577290Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8578571Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8579892Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8587726Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8588232Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:23.8588315Z Autotune Choices Stats: 2025-12-04T10:01:23.8589982Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.8590528Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8590904Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8591548Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8592891Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8594222Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8595533Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8596897Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8598252Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8599624Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8600938Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8602252Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8603567Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8604885Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8605174Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:23.8605326Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8605438Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8605506Z unimplemented [] 2025-12-04T10:01:23.8605620Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8605830Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8607231Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8607415Z graph_break [] 2025-12-04T10:01:23.8607587Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8607663Z Autotune Choices Stats: 2025-12-04T10:01:23.8609265Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8609562Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8609805Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8610166Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8611455Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8612723Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8613980Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8615284Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8616560Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8617952Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8618238Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:23.8618315Z Autotune Choices Stats: 2025-12-04T10:01:23.8619944Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.8620469Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8620829Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8621473Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8622822Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8624140Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8625493Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8626805Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8628298Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8629605Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8630926Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8632238Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8633551Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8634904Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8635192Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:23.8635333Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8635411Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8635477Z unimplemented [] 2025-12-04T10:01:23.8635596Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8635815Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8637316Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8637386Z graph_break [] 2025-12-04T10:01:23.8637519Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8637590Z Autotune Choices Stats: 2025-12-04T10:01:23.8639199Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8639498Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8639740Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8640093Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8641401Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8642683Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8644000Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8645282Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8646591Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8647939Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8648222Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:23.8648299Z Autotune Choices Stats: 2025-12-04T10:01:23.8649935Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.8650463Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8650824Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8651473Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8653028Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8654421Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8655906Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8657481Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8659303Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8660641Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8661961Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8663279Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8664608Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8666027Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8666321Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:23.8666469Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8666618Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8666687Z unimplemented [] 2025-12-04T10:01:23.8666804Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8667050Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8668523Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8668597Z graph_break [] 2025-12-04T10:01:23.8668733Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8668814Z Autotune Choices Stats: 2025-12-04T10:01:23.8670408Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8670701Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8670944Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8671301Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8672605Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8673880Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8675192Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8676490Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8677864Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8679141Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8679427Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:23.8679501Z Autotune Choices Stats: 2025-12-04T10:01:23.8681140Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.8681671Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8682033Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8682677Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8684038Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8685369Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8686728Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8688112Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8689444Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8690754Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8692076Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8693383Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8694742Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8696065Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8696457Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:23.8696595Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8696670Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8696731Z unimplemented [] 2025-12-04T10:01:23.8696839Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8697041Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8698435Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8698506Z graph_break [] 2025-12-04T10:01:23.8698650Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8698723Z Autotune Choices Stats: 2025-12-04T10:01:23.8700319Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8700612Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8700852Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8701204Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8702494Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8703823Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8705101Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8706471Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8707801Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8709090Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8709369Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:23.8709447Z Autotune Choices Stats: 2025-12-04T10:01:23.8711076Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.8711597Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8711957Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8712600Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8713967Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8715330Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8716722Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8718050Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8719371Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8720686Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8722008Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8723352Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8724672Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8726090Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8726369Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:23.8726503Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8726579Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8726640Z unimplemented [] 2025-12-04T10:01:23.8726743Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8726953Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8728344Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8728411Z graph_break [] 2025-12-04T10:01:23.8728544Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8728619Z Autotune Choices Stats: 2025-12-04T10:01:23.8730216Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8730511Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8730748Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8731099Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8732447Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8733724Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8735047Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8736388Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8737683Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8738957Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8739237Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:23.8739309Z Autotune Choices Stats: 2025-12-04T10:01:23.8740943Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.8741473Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8741870Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8742508Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8743838Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8745301Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8746617Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8747987Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8749309Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8750630Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8751996Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8753347Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8754732Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8756245Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8756533Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:23.8756688Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8756769Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8756831Z unimplemented [] 2025-12-04T10:01:23.8756937Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8757147Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8758532Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8758601Z graph_break [] 2025-12-04T10:01:23.8758735Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8758806Z Autotune Choices Stats: 2025-12-04T10:01:23.8760391Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8760681Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8760921Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8761272Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8762644Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8763963Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8765349Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8766634Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8767914Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8769197Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8769478Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:23.8769553Z Autotune Choices Stats: 2025-12-04T10:01:23.8771180Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.8771744Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8772108Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8772745Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8774174Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8775505Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8776830Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8778151Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8779469Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8780782Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8782140Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8783494Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8784871Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8786180Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8786471Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:23.8786605Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8786679Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8786742Z unimplemented [] 2025-12-04T10:01:23.8786846Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8787054Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8788524Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8788596Z graph_break [] 2025-12-04T10:01:23.8788732Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8788800Z Autotune Choices Stats: 2025-12-04T10:01:23.8790404Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8790739Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8790983Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8791342Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8792631Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8793996Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8795271Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8796566Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8797837Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8799113Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8799389Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:23.8799462Z Autotune Choices Stats: 2025-12-04T10:01:23.8801130Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:23.8801656Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8802012Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8802773Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8804125Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8805459Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8806789Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8808114Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8809435Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8810798Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8812124Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8813540Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8814865Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8816200Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8816491Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:23.8816630Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8816716Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8816786Z unimplemented [] 2025-12-04T10:01:23.8816893Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8817105Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8818499Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8818573Z graph_break [] 2025-12-04T10:01:23.8818706Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8818773Z Autotune Choices Stats: 2025-12-04T10:01:23.8820428Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8820722Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8820965Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8821316Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8822713Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8823981Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8825265Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8826548Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8827883Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8829151Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8829432Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:23.8829505Z Autotune Choices Stats: 2025-12-04T10:01:23.8831186Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.8831776Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8832169Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8832809Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8834152Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8835482Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8836815Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8838135Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8839726Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8841059Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8842439Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8843832Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8845168Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8846501Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8846798Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:23.8846938Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8847016Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8847080Z unimplemented [] 2025-12-04T10:01:23.8847187Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8847394Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8848783Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8848852Z graph_break [] 2025-12-04T10:01:23.8848988Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8849094Z Autotune Choices Stats: 2025-12-04T10:01:23.8850699Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8851050Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8851333Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8851689Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8852976Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8854241Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8855716Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8857011Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8858283Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8859639Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8859921Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:23.8859995Z Autotune Choices Stats: 2025-12-04T10:01:23.8861690Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.8862317Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8862686Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8863331Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8864665Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8866001Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8867424Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8868882Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8870260Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8871615Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8872994Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8874309Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8875652Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8876981Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8877267Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:23.8877407Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8877484Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8877549Z unimplemented [] 2025-12-04T10:01:23.8877655Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8877867Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8879313Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8879383Z graph_break [] 2025-12-04T10:01:23.8879517Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8879586Z Autotune Choices Stats: 2025-12-04T10:01:23.8881232Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8881588Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8881837Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8882189Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8883487Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8884759Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8886038Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8887315Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8888616Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8889900Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8890177Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:23.8890324Z Autotune Choices Stats: 2025-12-04T10:01:23.8892011Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.8892538Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8892902Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8893540Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8895000Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8896333Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8897655Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8899054Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8900377Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8901786Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8903109Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8904454Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8905775Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8907100Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8907446Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:23.8907580Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8907658Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8907721Z unimplemented [] 2025-12-04T10:01:23.8907825Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8908075Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8909459Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8909537Z graph_break [] 2025-12-04T10:01:23.8909673Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8909804Z Autotune Choices Stats: 2025-12-04T10:01:23.8911462Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.8911743Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8911989Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8912343Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8913634Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8914897Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8916180Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8917462Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8918773Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8920086Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8920428Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:23.8920508Z Autotune Choices Stats: 2025-12-04T10:01:23.8922159Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:23.8922688Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8923047Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8923684Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8925025Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8926360Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8927715Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8929036Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8930449Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8931768Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8933096Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8934409Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8935743Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8937060Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.8937385Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:23.8937523Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8937599Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8937661Z unimplemented [] 2025-12-04T10:01:23.8937765Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8937971Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8939389Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8939524Z graph_break [] 2025-12-04T10:01:23.8939664Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8939733Z Autotune Choices Stats: 2025-12-04T10:01:23.8941321Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.8941612Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8941859Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8942212Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8943498Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8944770Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8946052Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8947437Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8948728Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8950130Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8950413Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:23.8950480Z Autotune Choices Stats: 2025-12-04T10:01:23.8952131Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.8952663Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8953026Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8953671Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8955014Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8956510Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8957918Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8959294Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8960706Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8962034Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8963374Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8964685Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8966013Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8967362Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8967652Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:23.8967787Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8967861Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8967934Z unimplemented [] 2025-12-04T10:01:23.8968041Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8968441Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8969920Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.8969989Z graph_break [] 2025-12-04T10:01:23.8970123Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.8970190Z Autotune Choices Stats: 2025-12-04T10:01:23.8971793Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.8972093Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8972339Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8972691Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8973984Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8975245Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8976568Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.8977839Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8979208Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8980481Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.8980759Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:23.8980827Z Autotune Choices Stats: 2025-12-04T10:01:23.8982476Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:23.8982992Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.8983359Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.8983993Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.8985327Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8986720Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8988149Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8989587Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8990914Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8992244Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8993565Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.8994892Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.8996261Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.8997584Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.8997938Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:23.8998075Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.8998182Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.8998255Z unimplemented [] 2025-12-04T10:01:23.8998363Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.8998583Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.8999986Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9000054Z graph_break [] 2025-12-04T10:01:23.9000188Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9000257Z Autotune Choices Stats: 2025-12-04T10:01:23.9001872Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.9002159Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9002406Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9002760Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9004060Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9005363Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9006652Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9007966Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9009306Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9010586Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9010866Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:23.9010932Z Autotune Choices Stats: 2025-12-04T10:01:23.9012576Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:23.9013099Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9013463Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9014101Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9015481Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9016807Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9018228Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9019557Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9020888Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9022210Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9023527Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9024845Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9026244Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9027655Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9028005Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:23.9028138Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9028205Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9028277Z unimplemented [] 2025-12-04T10:01:23.9028395Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9028610Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9030005Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9030076Z graph_break [] 2025-12-04T10:01:23.9030208Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9030275Z Autotune Choices Stats: 2025-12-04T10:01:23.9031882Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9032169Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9032408Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9032760Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9034060Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9035378Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9036693Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9038034Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9039316Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9040590Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9040872Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:23.9040938Z Autotune Choices Stats: 2025-12-04T10:01:23.9042584Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:23.9043100Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9043464Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9044138Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9045472Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9046821Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9048256Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9049583Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9050905Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9052235Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9053557Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9054927Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9056459Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9057971Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9058256Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:23.9058389Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9058461Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9058532Z unimplemented [] 2025-12-04T10:01:23.9058636Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9058844Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9060243Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9060306Z graph_break [] 2025-12-04T10:01:23.9060445Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9060511Z Autotune Choices Stats: 2025-12-04T10:01:23.9062113Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_574", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9062398Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9062642Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9062993Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9064343Z triton_flex_attention_574 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9065618Z triton_flex_attention_575 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9066998Z triton_flex_attention_572 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9068316Z triton_flex_attention_570 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9069614Z triton_flex_attention_573 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9070886Z triton_flex_attention_571 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9071177Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3286 seconds precompiling for 6 choices 2025-12-04T10:01:23.9071247Z Autotune Choices Stats: 2025-12-04T10:01:23.9072887Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014303999952971935, "best_triton_pos": 0} 2025-12-04T10:01:23.9073407Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9073816Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9074451Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9075813Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9077195Z triton_flex_attention_backward_576 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9078517Z triton_flex_attention_backward_577 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9079842Z triton_flex_attention_backward_578 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9081159Z triton_flex_attention_backward_581 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9082487Z triton_flex_attention_backward_583 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9083843Z triton_flex_attention_backward_580 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9085166Z triton_flex_attention_backward_582 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9086591Z triton_flex_attention_backward_585 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9087904Z triton_flex_attention_backward_588 0.0174 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.9088191Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2260 seconds precompiling for 13 choices 2025-12-04T10:01:23.9088327Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9088396Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9088467Z unimplemented [] 2025-12-04T10:01:23.9088572Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9088773Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9090162Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9090227Z graph_break [] 2025-12-04T10:01:23.9090368Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9090435Z Autotune Choices Stats: 2025-12-04T10:01:23.9092037Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9092324Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9092586Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9092974Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9094262Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9095659Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9096940Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9098209Z triton_flex_attention_589 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9099496Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9100782Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9101063Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.2961 seconds precompiling for 6 choices 2025-12-04T10:01:23.9101129Z Autotune Choices Stats: 2025-12-04T10:01:23.9102810Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:23.9103333Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9103694Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9104325Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9105759Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9107081Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9108454Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9109786Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9111116Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9112474Z triton_flex_attention_backward_601 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9113795Z triton_flex_attention_backward_602 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9115149Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9116538Z triton_flex_attention_backward_604 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9117853Z triton_flex_attention_backward_603 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9118146Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.3242 seconds precompiling for 13 choices 2025-12-04T10:01:23.9118279Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9118346Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9118415Z unimplemented [] 2025-12-04T10:01:23.9118519Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9118723Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9120125Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9120186Z graph_break [] 2025-12-04T10:01:23.9120321Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9120387Z Autotune Choices Stats: 2025-12-04T10:01:23.9122022Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:23.9122309Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9122547Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9122893Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9124231Z triton_flex_attention_612 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9125570Z triton_flex_attention_613 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9126850Z triton_flex_attention_608 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9128128Z triton_flex_attention_610 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9129406Z triton_flex_attention_611 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9130682Z triton_flex_attention_609 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9130974Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3064 seconds precompiling for 6 choices 2025-12-04T10:01:23.9131048Z Autotune Choices Stats: 2025-12-04T10:01:23.9132724Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_616", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.9133243Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9133723Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9134361Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9135700Z triton_flex_attention_backward_616 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9137021Z triton_flex_attention_backward_615 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9138341Z triton_flex_attention_backward_617 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9139665Z triton_flex_attention_backward_614 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9140987Z triton_flex_attention_backward_619 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9142346Z triton_flex_attention_backward_620 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9143691Z triton_flex_attention_backward_618 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9145075Z triton_flex_attention_backward_621 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9146401Z triton_flex_attention_backward_623 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9147773Z triton_flex_attention_backward_622 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9148058Z SingleProcess AUTOTUNE benchmarking takes 0.6701 seconds and 2.3164 seconds precompiling for 13 choices 2025-12-04T10:01:23.9148199Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9148270Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9148340Z unimplemented [] 2025-12-04T10:01:23.9148447Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9148663Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9150060Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9150121Z graph_break [] 2025-12-04T10:01:23.9150265Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9150333Z Autotune Choices Stats: 2025-12-04T10:01:23.9151990Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9152275Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9152516Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9152965Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9154254Z triton_flex_attention_631 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9155886Z triton_flex_attention_632 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9157180Z triton_flex_attention_629 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9158449Z triton_flex_attention_627 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9159731Z triton_flex_attention_630 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9161011Z triton_flex_attention_628 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9161384Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.2948 seconds precompiling for 6 choices 2025-12-04T10:01:23.9161456Z Autotune Choices Stats: 2025-12-04T10:01:23.9163106Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:23.9163764Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9164134Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9164772Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9166112Z triton_flex_attention_backward_634 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9167439Z triton_flex_attention_backward_635 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9168771Z triton_flex_attention_backward_636 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9170100Z triton_flex_attention_backward_633 0.0144 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9171479Z triton_flex_attention_backward_638 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9172796Z triton_flex_attention_backward_640 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9174211Z triton_flex_attention_backward_637 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9175532Z triton_flex_attention_backward_639 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9176845Z triton_flex_attention_backward_642 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9178170Z triton_flex_attention_backward_641 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9178455Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.2589 seconds precompiling for 13 choices 2025-12-04T10:01:23.9178594Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9178662Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9178730Z unimplemented [] 2025-12-04T10:01:23.9178836Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9179039Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9180436Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9180539Z graph_break [] 2025-12-04T10:01:23.9180683Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9180750Z Autotune Choices Stats: 2025-12-04T10:01:23.9182347Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9182732Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9182978Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9183329Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9184619Z triton_flex_attention_650 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9185912Z triton_flex_attention_651 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9187188Z triton_flex_attention_646 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9188550Z triton_flex_attention_648 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9189828Z triton_flex_attention_649 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9191140Z triton_flex_attention_647 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9191423Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3235 seconds precompiling for 6 choices 2025-12-04T10:01:23.9191490Z Autotune Choices Stats: 2025-12-04T10:01:23.9193173Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_653", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.9193752Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9194116Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9194756Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9196105Z triton_flex_attention_backward_653 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9197426Z triton_flex_attention_backward_654 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9198755Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9200112Z triton_flex_attention_backward_652 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9201440Z triton_flex_attention_backward_656 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9202933Z triton_flex_attention_backward_657 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9204246Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9205563Z triton_flex_attention_backward_658 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9206898Z triton_flex_attention_backward_661 0.0164 ms 80.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9208228Z triton_flex_attention_backward_660 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9208516Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.4000 seconds precompiling for 13 choices 2025-12-04T10:01:23.9208708Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:23.9208790Z Traceback (most recent call last): 2025-12-04T10:01:23.9209145Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:23.9209212Z self.assertTrue( 2025-12-04T10:01:23.9209442Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:23.9209533Z raise self.failureException(msg) 2025-12-04T10:01:23.9209847Z AssertionError: False is not true : Log file /tmp/tmpv0q9k1ov/flex_attention_configs.json was not created 2025-12-04T10:01:23.9209853Z 2025-12-04T10:01:23.9210002Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:23.9210296Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:23.9210300Z 2025-12-04T10:01:23.9210477Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:23.9210633Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9210703Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9210835Z unimplemented [] 2025-12-04T10:01:23.9210950Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9212381Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:23.9212594Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9212656Z graph_break [] 2025-12-04T10:01:23.9212797Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9213962Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:23.9214047Z current_size = base.storage().size() 2025-12-04T10:01:23.9214123Z Autotune Choices Stats: 2025-12-04T10:01:23.9215720Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:23.9216026Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9216267Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9216624Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9217907Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9219220Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9220498Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9221869Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9223138Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9224416Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9224704Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:23.9224773Z Autotune Choices Stats: 2025-12-04T10:01:23.9226412Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:23.9226939Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9227403Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9228047Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9229417Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9230781Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9232190Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9233509Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9234833Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9236143Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9237477Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9238840Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9240157Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9241586Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9241867Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:23.9242011Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9242082Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9242145Z unimplemented [] 2025-12-04T10:01:23.9242255Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9242457Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9243862Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9243923Z graph_break [] 2025-12-04T10:01:23.9244057Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9244127Z Autotune Choices Stats: 2025-12-04T10:01:23.9245739Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9246034Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9246271Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9246628Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9247937Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9249214Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9250514Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9251857Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9253123Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9254395Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9254679Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:23.9254747Z Autotune Choices Stats: 2025-12-04T10:01:23.9256744Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9257278Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9257646Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9258363Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9259700Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9261171Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9262495Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9263811Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9265122Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9266426Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9267817Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9269181Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9270531Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9271910Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9272191Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:23.9272338Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9272408Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9272471Z unimplemented [] 2025-12-04T10:01:23.9272584Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9272789Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9274195Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9274258Z graph_break [] 2025-12-04T10:01:23.9274396Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9274470Z Autotune Choices Stats: 2025-12-04T10:01:23.9276076Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9276365Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9276604Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9276965Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9278283Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9279638Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9280968Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9282246Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9283513Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9284792Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9285072Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:23.9285145Z Autotune Choices Stats: 2025-12-04T10:01:23.9286794Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9287352Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9287709Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9288354Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9289723Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9291121Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9292453Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9293766Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9295090Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9296415Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9297769Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9299080Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9300498Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.9301818Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9302102Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:23.9302240Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9302311Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9302375Z unimplemented [] 2025-12-04T10:01:23.9302493Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9302695Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9304084Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9304157Z graph_break [] 2025-12-04T10:01:23.9304290Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9304364Z Autotune Choices Stats: 2025-12-04T10:01:23.9305953Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9306258Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9306534Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9306893Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9308241Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9309624Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9310896Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9312177Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9313447Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9314749Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9315095Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:23.9315217Z Autotune Choices Stats: 2025-12-04T10:01:23.9317064Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9317598Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9317960Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9318700Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9320029Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9321357Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9322675Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9324006Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9325343Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9326714Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9328041Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9329455Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9330775Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9332106Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.9332384Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:23.9332524Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9332592Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9332655Z unimplemented [] 2025-12-04T10:01:23.9332763Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9332969Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9334357Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9334424Z graph_break [] 2025-12-04T10:01:23.9334556Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9334627Z Autotune Choices Stats: 2025-12-04T10:01:23.9336256Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9336553Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9336801Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9337158Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9338487Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9339812Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9341093Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9342364Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9343633Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9344916Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9345193Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:23.9345267Z Autotune Choices Stats: 2025-12-04T10:01:23.9347308Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9347848Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9348329Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9348971Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9350299Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9351631Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9352955Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9354281Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9355788Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9357181Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9358602Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9360007Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9361336Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.9362659Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9362935Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:23.9363076Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9363153Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9363216Z unimplemented [] 2025-12-04T10:01:23.9363325Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9363530Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9364907Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9364979Z graph_break [] 2025-12-04T10:01:23.9365112Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9365189Z Autotune Choices Stats: 2025-12-04T10:01:23.9366822Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9367113Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9367415Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9367798Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9369086Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9370364Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9371802Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9373087Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9374362Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9375690Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9375981Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:23.9376055Z Autotune Choices Stats: 2025-12-04T10:01:23.9377729Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9378319Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9378678Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9379318Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9380654Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9381990Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9383320Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9384641Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9386008Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9387400Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9388852Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9390175Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9391506Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9392828Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9393112Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:23.9393246Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9393319Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9393382Z unimplemented [] 2025-12-04T10:01:23.9393490Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9393694Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9395116Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9395186Z graph_break [] 2025-12-04T10:01:23.9395317Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9395388Z Autotune Choices Stats: 2025-12-04T10:01:23.9396993Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9397396Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9397631Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9397980Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9399262Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9400538Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9401810Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9403094Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9404382Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9405698Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9405987Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:23.9406058Z Autotune Choices Stats: 2025-12-04T10:01:23.9407724Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.9408312Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9408673Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9409318Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9410651Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9411981Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9413321Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9414690Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9416020Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9417427Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9418747Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9420082Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9421400Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.9422729Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9423009Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:23.9423143Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9423218Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9423283Z unimplemented [] 2025-12-04T10:01:23.9423396Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9423609Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9425038Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9425108Z graph_break [] 2025-12-04T10:01:23.9425240Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9425313Z Autotune Choices Stats: 2025-12-04T10:01:23.9427032Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:23.9427395Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9427636Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9427985Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9429280Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9430553Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9431836Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9433120Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9434428Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9435706Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9436099Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:23.9436173Z Autotune Choices Stats: 2025-12-04T10:01:23.9437819Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9438346Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9438711Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9439350Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9440684Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9442015Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9443368Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9444698Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9446064Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9447441Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9448765Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9450080Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9451407Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9452731Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.9453011Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:23.9453182Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9453259Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9453321Z unimplemented [] 2025-12-04T10:01:23.9453423Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9453627Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9455013Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9455144Z graph_break [] 2025-12-04T10:01:23.9455511Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9455595Z Autotune Choices Stats: 2025-12-04T10:01:23.9457191Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:23.9457483Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9457724Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9458076Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9459367Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9460641Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9461914Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9463268Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9464538Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9465940Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9466218Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:23.9466292Z Autotune Choices Stats: 2025-12-04T10:01:23.9467994Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9468520Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9468881Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9469520Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9470862Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9472453Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9473840Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9475213Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9476599Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9477922Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9479247Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9480567Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9481898Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9483245Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.9483532Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:23.9483667Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9483741Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9483804Z unimplemented [] 2025-12-04T10:01:23.9483908Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9484115Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9485613Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9485681Z graph_break [] 2025-12-04T10:01:23.9485815Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9485889Z Autotune Choices Stats: 2025-12-04T10:01:23.9487488Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9487777Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9488013Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9488362Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9489651Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9490924Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9492244Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9493521Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9494830Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9496193Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9496471Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:23.9496546Z Autotune Choices Stats: 2025-12-04T10:01:23.9498186Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9498707Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9499074Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9499718Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9501065Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9502434Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9503756Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9505185Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9506513Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9507885Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9509208Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9510539Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9511866Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9513277Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.9513566Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:23.9513767Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9513844Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9513908Z unimplemented [] 2025-12-04T10:01:23.9514044Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9514256Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9515635Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9515706Z graph_break [] 2025-12-04T10:01:23.9515841Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9515913Z Autotune Choices Stats: 2025-12-04T10:01:23.9517517Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:23.9517804Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9518050Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9518405Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9519697Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9520972Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9522307Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9523619Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9524949Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9526226Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9526507Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:23.9526582Z Autotune Choices Stats: 2025-12-04T10:01:23.9528215Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9528740Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9529100Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9529740Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9531103Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9532439Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9533796Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9535200Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9536525Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9537847Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9539168Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9540483Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9541842Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.9543156Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9543535Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:23.9543671Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9543744Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9543806Z unimplemented [] 2025-12-04T10:01:23.9543909Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9544119Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9545507Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9545576Z graph_break [] 2025-12-04T10:01:23.9545706Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9545781Z Autotune Choices Stats: 2025-12-04T10:01:23.9547440Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9547728Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9547971Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9548316Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9549599Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9550916Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9552195Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9553564Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9554835Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9556277Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9556558Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:23.9556630Z Autotune Choices Stats: 2025-12-04T10:01:23.9558275Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9558797Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9559165Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9559814Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9561210Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9562604Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9564014Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9565343Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9566660Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9567982Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9569317Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9570669Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9571994Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.9573421Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9573711Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:23.9573855Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9573930Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9573994Z unimplemented [] 2025-12-04T10:01:23.9574097Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9574313Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9575699Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9575767Z graph_break [] 2025-12-04T10:01:23.9575898Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9575963Z Autotune Choices Stats: 2025-12-04T10:01:23.9577573Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:23.9577858Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9578104Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9578455Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9579778Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9581047Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9582423Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9583707Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9584980Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9586262Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9586538Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:23.9586613Z Autotune Choices Stats: 2025-12-04T10:01:23.9588273Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.9588810Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9589230Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9589871Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9591222Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9592656Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9593975Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9595306Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9596626Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9597951Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9599315Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9600646Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9602000Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9603381Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9603671Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:23.9603810Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9603888Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9603954Z unimplemented [] 2025-12-04T10:01:23.9604065Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9604286Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9605686Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9605758Z graph_break [] 2025-12-04T10:01:23.9605897Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9605966Z Autotune Choices Stats: 2025-12-04T10:01:23.9607565Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9607853Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9608112Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9608521Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9609820Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9611125Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9612469Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9613745Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9615025Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9616307Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9616597Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:23.9616668Z Autotune Choices Stats: 2025-12-04T10:01:23.9618317Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9618878Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9619242Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9619883Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9621323Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9622647Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9623971Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9625289Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9626607Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9627979Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9629356Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9630710Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9632097Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9633424Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.9633714Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:23.9633851Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9633928Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9633991Z unimplemented [] 2025-12-04T10:01:23.9634098Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9634309Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9635701Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9635770Z graph_break [] 2025-12-04T10:01:23.9635910Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9635978Z Autotune Choices Stats: 2025-12-04T10:01:23.9637600Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9637929Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9638180Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9638532Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9639850Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9641202Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9642484Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9643758Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9645035Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9646326Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9646610Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:23.9646679Z Autotune Choices Stats: 2025-12-04T10:01:23.9648370Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.9648896Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9649261Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9649989Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9651330Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9652665Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9653993Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9655583Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9656939Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9658345Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9659671Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9661179Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9662518Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.9663840Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9664130Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:23.9664269Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9664352Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9664416Z unimplemented [] 2025-12-04T10:01:23.9664525Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9664738Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9666142Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9666211Z graph_break [] 2025-12-04T10:01:23.9666344Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9666413Z Autotune Choices Stats: 2025-12-04T10:01:23.9668120Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9668414Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9668660Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9669018Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9670408Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9671678Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9672957Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9674226Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9675510Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9676787Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9677101Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:23.9677171Z Autotune Choices Stats: 2025-12-04T10:01:23.9678824Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9679461Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9679831Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9680462Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9681801Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9683129Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9684448Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9685782Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9687130Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9688611Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9690058Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9691379Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9692705Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9694029Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.9694328Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:23.9694471Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9694550Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9694613Z unimplemented [] 2025-12-04T10:01:23.9694720Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9694928Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9696320Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9696389Z graph_break [] 2025-12-04T10:01:23.9696559Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9696629Z Autotune Choices Stats: 2025-12-04T10:01:23.9698219Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9698570Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9698846Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9699203Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9700483Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9701753Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9703033Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9704315Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9705591Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9706914Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9707195Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:23.9707311Z Autotune Choices Stats: 2025-12-04T10:01:23.9708999Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9709581Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9709947Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9710585Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9711954Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9713272Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9714601Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9715928Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9717278Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9718662Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9720040Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9721365Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9722693Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9724007Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.9728906Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:23.9729083Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9729160Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9729234Z unimplemented [] 2025-12-04T10:01:23.9729348Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9729568Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9731063Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9731141Z graph_break [] 2025-12-04T10:01:23.9731285Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9731356Z Autotune Choices Stats: 2025-12-04T10:01:23.9733018Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9733375Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9733627Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9733988Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9735296Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9736598Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9737909Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9739194Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9740524Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9741806Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9742097Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:23.9742231Z Autotune Choices Stats: 2025-12-04T10:01:23.9743924Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9744440Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9744817Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9745456Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9746816Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9748222Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9749558Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9750935Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9752253Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9753679Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9755000Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9756526Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9757853Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9759171Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9759460Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:23.9759604Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9759675Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9759745Z unimplemented [] 2025-12-04T10:01:23.9759853Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9760171Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9761570Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9761633Z graph_break [] 2025-12-04T10:01:23.9761873Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9761939Z Autotune Choices Stats: 2025-12-04T10:01:23.9763599Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9763887Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9764132Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9764493Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9765797Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9767070Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9768359Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9769634Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9770947Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9772255Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9772604Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:23.9772672Z Autotune Choices Stats: 2025-12-04T10:01:23.9774314Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.9774842Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9775211Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9775853Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9777200Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9778532Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9779904Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9781231Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9782644Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9783969Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9785288Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9786620Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9788000Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.9789337Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9789664Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:23.9789802Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9789870Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9789940Z unimplemented [] 2025-12-04T10:01:23.9790056Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9790263Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9791685Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9791829Z graph_break [] 2025-12-04T10:01:23.9791967Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9792035Z Autotune Choices Stats: 2025-12-04T10:01:23.9793643Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9793929Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9794173Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9794527Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9795820Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9797100Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9798392Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9799712Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9801000Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9802366Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9802656Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:23.9802723Z Autotune Choices Stats: 2025-12-04T10:01:23.9804379Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9804901Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9805267Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9805924Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9807267Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9808630Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9809961Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9811317Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9812696Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9814019Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9815344Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9816681Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9818004Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9819360Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.9819645Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:23.9819777Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9819843Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9819911Z unimplemented [] 2025-12-04T10:01:23.9820134Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9820338Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9821761Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9821823Z graph_break [] 2025-12-04T10:01:23.9821963Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9822028Z Autotune Choices Stats: 2025-12-04T10:01:23.9823625Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9823907Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9824148Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9824505Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9825793Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9827071Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9828432Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9829711Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9831107Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9832384Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9832666Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:23.9832734Z Autotune Choices Stats: 2025-12-04T10:01:23.9834396Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.9834912Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9835284Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9835928Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9837265Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9838627Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9839994Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9841383Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9842700Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9844059Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9845375Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9846701Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9848066Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9849388Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.9849734Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:23.9849900Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9849968Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9850039Z unimplemented [] 2025-12-04T10:01:23.9850143Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9850348Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9851743Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9851807Z graph_break [] 2025-12-04T10:01:23.9851950Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9852017Z Autotune Choices Stats: 2025-12-04T10:01:23.9853617Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9853897Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9854143Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9854498Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9856059Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9857839Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9859139Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9860473Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9861866Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9863146Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9863433Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:23.9863500Z Autotune Choices Stats: 2025-12-04T10:01:23.9865152Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:23.9865747Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9866185Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9866942Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9868473Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9869791Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9871231Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9872548Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9873869Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9875414Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9876769Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9878139Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9879465Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9880829Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.9881177Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:23.9881317Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9881386Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9881455Z unimplemented [] 2025-12-04T10:01:23.9881559Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9881764Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9883149Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9883214Z graph_break [] 2025-12-04T10:01:23.9883356Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9883422Z Autotune Choices Stats: 2025-12-04T10:01:23.9885032Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9885321Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9885564Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9885919Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9887209Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9888533Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9889847Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9891200Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9892488Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9893766Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9894049Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:23.9894119Z Autotune Choices Stats: 2025-12-04T10:01:23.9895765Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:23.9896279Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9896646Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9897324Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9898662Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9900083Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9901422Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9902750Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9904083Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9905411Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9906723Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9908204Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9909618Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9911018Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9911303Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:23.9911441Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9911515Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9911585Z unimplemented [] 2025-12-04T10:01:23.9911691Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9911897Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9913292Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9913351Z graph_break [] 2025-12-04T10:01:23.9913489Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9913559Z Autotune Choices Stats: 2025-12-04T10:01:23.9915175Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9915465Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9915706Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9916071Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9917401Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9918684Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9920057Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9921336Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9922632Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9923910Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9924202Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:23.9924268Z Autotune Choices Stats: 2025-12-04T10:01:23.9925920Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9926443Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9926846Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9927491Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9928881Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9930267Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9931585Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9932906Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9934232Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9935552Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9936910Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9938235Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9939664Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9940988Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.9941266Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:23.9941409Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9941486Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9941555Z unimplemented [] 2025-12-04T10:01:23.9941661Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9941862Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9943255Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9943321Z graph_break [] 2025-12-04T10:01:23.9943458Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9943524Z Autotune Choices Stats: 2025-12-04T10:01:23.9945127Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9945413Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9945711Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9946073Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9947429Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9948823Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9950111Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9951392Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9952667Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9953951Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9954237Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:23.9954305Z Autotune Choices Stats: 2025-12-04T10:01:23.9956222Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:23.9956756Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9957124Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9957769Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9961288Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9964139Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9965543Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9966897Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9968215Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9969555Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9970900Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9972216Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9973710Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9975239Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:23.9975590Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:23.9975751Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:23.9975837Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:23.9975903Z unimplemented [] 2025-12-04T10:01:23.9976013Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:23.9976233Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:23.9977625Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:23.9977693Z graph_break [] 2025-12-04T10:01:23.9977833Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:23.9977905Z Autotune Choices Stats: 2025-12-04T10:01:23.9979524Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:23.9979813Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9980060Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9980415Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9981706Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9983123Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9984443Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:23.9985735Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:23.9987018Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9988378Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9988665Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:23.9988741Z Autotune Choices Stats: 2025-12-04T10:01:23.9990398Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:23.9990928Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:23.9991373Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:23.9992015Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:23.9993400Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:23.9994765Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:23.9996103Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:23.9997432Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:23.9998749Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0000069Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0001387Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0002809Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0004164Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0005478Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.0005772Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:24.0005917Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.0005993Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.0006067Z unimplemented [] 2025-12-04T10:01:24.0006176Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.0006388Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.0007783Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.0007854Z graph_break [] 2025-12-04T10:01:24.0007988Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.0008055Z Autotune Choices Stats: 2025-12-04T10:01:24.0009653Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.0009941Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0010182Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0010619Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0011946Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0013267Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0014554Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.0015833Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0017115Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.0018401Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0018683Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:24.0018750Z Autotune Choices Stats: 2025-12-04T10:01:24.0020394Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.0020985Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0021356Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0022024Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0023411Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0024745Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0026075Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0027477Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0028796Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0030139Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0031527Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0032892Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0034246Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0035560Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0035847Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:24.0035980Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.0036050Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.0036118Z unimplemented [] 2025-12-04T10:01:24.0036221Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.0036429Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.0037823Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.0037887Z graph_break [] 2025-12-04T10:01:24.0038024Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.0038090Z Autotune Choices Stats: 2025-12-04T10:01:24.0039690Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.0040051Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0040296Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0040647Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0041979Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0043287Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0044569Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.0045837Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0047121Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0048397Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.0048678Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:24.0048744Z Autotune Choices Stats: 2025-12-04T10:01:24.0050383Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.0051024Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0051395Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0052070Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0053415Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0054739Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0056262Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0057588Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0058911Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0060368Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0061744Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0063135Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0064466Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0065791Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0066080Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:24.0066218Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.0066286Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.0066359Z unimplemented [] 2025-12-04T10:01:24.0066464Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.0066679Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.0068124Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.0068187Z graph_break [] 2025-12-04T10:01:24.0068328Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.0068395Z Autotune Choices Stats: 2025-12-04T10:01:24.0070001Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.0070360Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0070644Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0071000Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0072346Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0073626Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0074913Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.0076202Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0077480Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.0078756Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0079104Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:24.0079172Z Autotune Choices Stats: 2025-12-04T10:01:24.0080857Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:24.0081376Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0081776Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0082412Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0083751Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0085077Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0086409Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0087743Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0089063Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0090510Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0091862Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0093190Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0094523Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0095853Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0096143Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:24.0096277Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.0096346Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.0096417Z unimplemented [] 2025-12-04T10:01:24.0096522Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.0096726Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.0098119Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.0098252Z graph_break [] 2025-12-04T10:01:24.0098393Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.0098462Z Autotune Choices Stats: 2025-12-04T10:01:24.0100103Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.0100388Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0100635Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0101021Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0102310Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0103602Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0104888Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.0106164Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0107493Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.0108843Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0109137Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:24.0109204Z Autotune Choices Stats: 2025-12-04T10:01:24.0110915Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.0111439Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0111804Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0112445Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0113784Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0115115Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0116437Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0117776Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0119207Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0120564Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0121889Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0123206Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0124525Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0125860Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0126147Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:24.0126280Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.0126349Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.0126418Z unimplemented [] 2025-12-04T10:01:24.0126521Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.0126723Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.0128194Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.0128257Z graph_break [] 2025-12-04T10:01:24.0128393Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.0128459Z Autotune Choices Stats: 2025-12-04T10:01:24.0130192Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_574", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.0130485Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0130738Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0131096Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0132381Z triton_flex_attention_574 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0133664Z triton_flex_attention_575 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0134951Z triton_flex_attention_572 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.0136236Z triton_flex_attention_570 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0137519Z triton_flex_attention_573 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.0138895Z triton_flex_attention_571 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0139178Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3286 seconds precompiling for 6 choices 2025-12-04T10:01:24.0139248Z Autotune Choices Stats: 2025-12-04T10:01:24.0140940Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014303999952971935, "best_triton_pos": 0} 2025-12-04T10:01:24.0141463Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0141828Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0142467Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0143806Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0145131Z triton_flex_attention_backward_576 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0146457Z triton_flex_attention_backward_577 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0147895Z triton_flex_attention_backward_578 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0149252Z triton_flex_attention_backward_581 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0150613Z triton_flex_attention_backward_583 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0151934Z triton_flex_attention_backward_580 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0153262Z triton_flex_attention_backward_582 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0154579Z triton_flex_attention_backward_585 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0156058Z triton_flex_attention_backward_588 0.0174 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.0156353Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2260 seconds precompiling for 13 choices 2025-12-04T10:01:24.0156492Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.0156704Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.0156779Z unimplemented [] 2025-12-04T10:01:24.0156888Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.0157099Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.0158545Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.0158610Z graph_break [] 2025-12-04T10:01:24.0158751Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.0158827Z Autotune Choices Stats: 2025-12-04T10:01:24.0160491Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.0160782Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0161028Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0161387Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0162678Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0163958Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0165245Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.0166526Z triton_flex_attention_589 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0167909Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.0169186Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0169503Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.2961 seconds precompiling for 6 choices 2025-12-04T10:01:24.0169573Z Autotune Choices Stats: 2025-12-04T10:01:24.0171218Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.0171741Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0172111Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0172751Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0174087Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0175411Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0176750Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0178174Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0179541Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0180864Z triton_flex_attention_backward_601 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0182183Z triton_flex_attention_backward_602 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0183513Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0184838Z triton_flex_attention_backward_604 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0186175Z triton_flex_attention_backward_603 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0186529Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.3242 seconds precompiling for 13 choices 2025-12-04T10:01:24.0186665Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.0186733Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.0186803Z unimplemented [] 2025-12-04T10:01:24.0186915Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.0187155Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.0188626Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.0188728Z graph_break [] 2025-12-04T10:01:24.0188878Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.0188946Z Autotune Choices Stats: 2025-12-04T10:01:24.0190542Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.0190834Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0191081Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0191435Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0192723Z triton_flex_attention_612 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0194018Z triton_flex_attention_613 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0195291Z triton_flex_attention_608 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0196644Z triton_flex_attention_610 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.0197969Z triton_flex_attention_611 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.0199277Z triton_flex_attention_609 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0199567Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3064 seconds precompiling for 6 choices 2025-12-04T10:01:24.0199632Z Autotune Choices Stats: 2025-12-04T10:01:24.0201276Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_616", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.0201799Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0202166Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0202804Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0204146Z triton_flex_attention_backward_616 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0205470Z triton_flex_attention_backward_615 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0206905Z triton_flex_attention_backward_617 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0208253Z triton_flex_attention_backward_614 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0209582Z triton_flex_attention_backward_619 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0210904Z triton_flex_attention_backward_620 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0212232Z triton_flex_attention_backward_618 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0213561Z triton_flex_attention_backward_621 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0214889Z triton_flex_attention_backward_623 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0216282Z triton_flex_attention_backward_622 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0216573Z SingleProcess AUTOTUNE benchmarking takes 0.6701 seconds and 2.3164 seconds precompiling for 13 choices 2025-12-04T10:01:24.0216752Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.0216824Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.0216894Z unimplemented [] 2025-12-04T10:01:24.0217002Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.0217214Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.0218648Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.0218711Z graph_break [] 2025-12-04T10:01:24.0218856Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.0218924Z Autotune Choices Stats: 2025-12-04T10:01:24.0220542Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.0220834Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0221079Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0221445Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0222743Z triton_flex_attention_631 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0224037Z triton_flex_attention_632 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0225390Z triton_flex_attention_629 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.0226699Z triton_flex_attention_627 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0228078Z triton_flex_attention_630 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.0229364Z triton_flex_attention_628 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0229654Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.2948 seconds precompiling for 6 choices 2025-12-04T10:01:24.0229722Z Autotune Choices Stats: 2025-12-04T10:01:24.0231380Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.0231900Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0232275Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0232929Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0234273Z triton_flex_attention_backward_634 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0235693Z triton_flex_attention_backward_635 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0237069Z triton_flex_attention_backward_636 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0238432Z triton_flex_attention_backward_633 0.0144 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0239754Z triton_flex_attention_backward_638 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0241087Z triton_flex_attention_backward_640 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0242411Z triton_flex_attention_backward_637 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0243739Z triton_flex_attention_backward_639 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0245061Z triton_flex_attention_backward_642 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0246490Z triton_flex_attention_backward_641 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0246775Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.2589 seconds precompiling for 13 choices 2025-12-04T10:01:24.0246922Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.0246992Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.0247090Z unimplemented [] 2025-12-04T10:01:24.0247199Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.0247404Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.0248800Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.0248863Z graph_break [] 2025-12-04T10:01:24.0249001Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.0249068Z Autotune Choices Stats: 2025-12-04T10:01:24.0250684Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.0250967Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0251211Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0251573Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0252872Z triton_flex_attention_650 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0254175Z triton_flex_attention_651 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0255803Z triton_flex_attention_646 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0257340Z triton_flex_attention_648 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.0258646Z triton_flex_attention_649 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.0260258Z triton_flex_attention_647 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0262749Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3235 seconds precompiling for 6 choices 2025-12-04T10:01:24.0263390Z Autotune Choices Stats: 2025-12-04T10:01:24.0266182Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_653", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.0269851Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0271413Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0273184Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0276652Z triton_flex_attention_backward_653 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0279888Z triton_flex_attention_backward_654 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0283250Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0287674Z triton_flex_attention_backward_652 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0292014Z triton_flex_attention_backward_656 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0295306Z triton_flex_attention_backward_657 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0298184Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0300925Z triton_flex_attention_backward_658 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0303844Z triton_flex_attention_backward_661 0.0164 ms 80.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0306620Z triton_flex_attention_backward_660 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0308378Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.4000 seconds precompiling for 13 choices 2025-12-04T10:01:24.0308898Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.0309203Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.0309412Z unimplemented [] 2025-12-04T10:01:24.0309631Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.0310031Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.0311724Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.0313258Z graph_break [] 2025-12-04T10:01:24.0313494Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.0313797Z Autotune Choices Stats: 2025-12-04T10:01:24.0315537Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.0317519Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0318136Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0318830Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0320586Z triton_flex_attention_669 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0323366Z triton_flex_attention_670 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0326056Z triton_flex_attention_665 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0328750Z triton_flex_attention_667 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.0331400Z triton_flex_attention_668 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.0334054Z triton_flex_attention_666 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0335710Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.2940 seconds precompiling for 6 choices 2025-12-04T10:01:24.0336146Z Autotune Choices Stats: 2025-12-04T10:01:24.0337930Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.0340178Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0341151Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0342321Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0344431Z triton_flex_attention_backward_673 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0347286Z triton_flex_attention_backward_674 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0350049Z triton_flex_attention_backward_671 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0352786Z triton_flex_attention_backward_672 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0355727Z triton_flex_attention_backward_676 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0358479Z triton_flex_attention_backward_675 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0361214Z triton_flex_attention_backward_677 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0364069Z triton_flex_attention_backward_678 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0367131Z triton_flex_attention_backward_680 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0370049Z triton_flex_attention_backward_683 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.0371774Z SingleProcess AUTOTUNE benchmarking takes 0.6679 seconds and 2.3879 seconds precompiling for 13 choices 2025-12-04T10:01:24.0372350Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:24.0372719Z Traceback (most recent call last): 2025-12-04T10:01:24.0373223Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:24.0373738Z self.assertTrue( 2025-12-04T10:01:24.0374087Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:24.0374478Z raise self.failureException(msg) 2025-12-04T10:01:24.0374919Z AssertionError: False is not true : Log file /tmp/tmp49gtu2vp/flex_attention_configs.json was not created 2025-12-04T10:01:24.0375294Z 2025-12-04T10:01:24.0375466Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:24.0376094Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:24.0376545Z 2025-12-04T10:01:24.0376746Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:24.0377155Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.0377466Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.0377669Z unimplemented [] 2025-12-04T10:01:24.0377879Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.0379740Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:24.0381438Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.0381806Z graph_break [] 2025-12-04T10:01:24.0382037Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.0383546Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:24.0384869Z current_size = base.storage().size() 2025-12-04T10:01:24.0385105Z Autotune Choices Stats: 2025-12-04T10:01:24.0386880Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:24.0388940Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0389595Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0390290Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0392042Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0394692Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0397339Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.0399977Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0402633Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.0405352Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0407041Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:24.0407483Z Autotune Choices Stats: 2025-12-04T10:01:24.0409272Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.0411518Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0412496Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0413598Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0415676Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0418418Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0421165Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0423892Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0426738Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0429581Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0432321Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0435074Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0437802Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0440524Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0442210Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:24.0442719Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.0443022Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.0443227Z unimplemented [] 2025-12-04T10:01:24.0443436Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.0443911Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.0445614Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.0447147Z graph_break [] 2025-12-04T10:01:24.0447428Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.0447728Z Autotune Choices Stats: 2025-12-04T10:01:24.0449493Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.0451473Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0452090Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0452774Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0454513Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0457483Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0460127Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.0463930Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.0467637Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0470392Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0472070Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:24.0472514Z Autotune Choices Stats: 2025-12-04T10:01:24.0474354Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.0476642Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0477621Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0478713Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0480831Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0483598Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0486341Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0489525Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0492298Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0495031Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0497757Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0500498Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0503222Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0505961Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0507735Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:24.0508350Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.0508655Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.0508864Z unimplemented [] 2025-12-04T10:01:24.0509078Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.0509477Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.0511213Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.0512749Z graph_break [] 2025-12-04T10:01:24.0512989Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.0513285Z Autotune Choices Stats: 2025-12-04T10:01:24.0515056Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.0517025Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0517645Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0518331Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0520079Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0522733Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0525391Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.0528042Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.0530830Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0533493Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0535145Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:24.0535585Z Autotune Choices Stats: 2025-12-04T10:01:24.0537353Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.0539596Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0540559Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0541659Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0543745Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0546507Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0549370Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0552726Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0556999Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0561285Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0565117Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0568355Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0572444Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.0576550Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0579293Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:24.0580091Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.0580574Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.0580898Z unimplemented [] 2025-12-04T10:01:24.0581224Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.0581948Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.0583748Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.0585292Z graph_break [] 2025-12-04T10:01:24.0585542Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.0585850Z Autotune Choices Stats: 2025-12-04T10:01:24.0587667Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.0589636Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0590254Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0590940Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0592690Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0595366Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0598038Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.0600791Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.0603474Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0606179Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0607842Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:24.0608289Z Autotune Choices Stats: 2025-12-04T10:01:24.0610063Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.0612318Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0613296Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0614402Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0616495Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0619273Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0622172Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0624960Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0627775Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0630517Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0633255Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0635990Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0638725Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0641532Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.0643259Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:24.0643773Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.0644074Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.0644286Z unimplemented [] 2025-12-04T10:01:24.0644503Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.0644899Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.0646640Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.0648172Z graph_break [] 2025-12-04T10:01:24.0648405Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.0648704Z Autotune Choices Stats: 2025-12-04T10:01:24.0650432Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.0652389Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0653015Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0653706Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0655757Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0658470Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0661313Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.0664020Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.0666740Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0669466Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0671114Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:24.0671550Z Autotune Choices Stats: 2025-12-04T10:01:24.0673316Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.0675569Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0676548Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0677646Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0679734Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0682602Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0685386Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0688133Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0690873Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0693610Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0696346Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0699093Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0701893Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.0704653Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0706355Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:24.0706923Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.0707308Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.0707514Z unimplemented [] 2025-12-04T10:01:24.0707728Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.0708129Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.0709815Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.0711349Z graph_break [] 2025-12-04T10:01:24.0711581Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.0711893Z Autotune Choices Stats: 2025-12-04T10:01:24.0713614Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.0715579Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0716189Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0716879Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0718629Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0721285Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0724043Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.0726725Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.0729365Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0732004Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0733665Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:24.0734101Z Autotune Choices Stats: 2025-12-04T10:01:24.0735877Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.0738132Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0739094Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0740193Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0742350Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0745123Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0747971Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0750711Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0753450Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0756335Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0759073Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0761839Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0764808Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0767606Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0769311Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:24.0769822Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.0770132Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.0770341Z unimplemented [] 2025-12-04T10:01:24.0770549Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.0770953Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.0772648Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.0774187Z graph_break [] 2025-12-04T10:01:24.0774417Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.0774723Z Autotune Choices Stats: 2025-12-04T10:01:24.0776465Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.0778443Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0779058Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0779743Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0781554Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0784357Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0787040Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0789779Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.0792443Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.0795094Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0796749Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:24.0797199Z Autotune Choices Stats: 2025-12-04T10:01:24.0798971Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.0801223Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0802185Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0803363Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0805477Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0808274Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0811039Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0813799Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0816559Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0819303Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0822042Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0824844Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0827674Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.0830468Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0832170Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:24.0832691Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.0832993Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.0833201Z unimplemented [] 2025-12-04T10:01:24.0833415Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.0833812Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.0835510Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.0837065Z graph_break [] 2025-12-04T10:01:24.0843891Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.0844264Z Autotune Choices Stats: 2025-12-04T10:01:24.0846029Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:24.0848011Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0848626Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0849463Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0851221Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0853899Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0856852Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.0859492Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.0862127Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0864752Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0866390Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:24.0866826Z Autotune Choices Stats: 2025-12-04T10:01:24.0868681Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.0871028Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0872002Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0873135Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0875234Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0877960Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0880667Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0883722Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0886449Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0889156Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0891961Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0894835Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0898195Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0900928Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.0902616Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:24.0903145Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.0903462Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.0903784Z unimplemented [] 2025-12-04T10:01:24.0904007Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.0904557Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.0906320Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.0907919Z graph_break [] 2025-12-04T10:01:24.0908164Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.0908472Z Autotune Choices Stats: 2025-12-04T10:01:24.0910197Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:24.0912259Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0912873Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0913555Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0915321Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0917987Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0920616Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0923243Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.0925880Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.0928539Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0930171Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:24.0930618Z Autotune Choices Stats: 2025-12-04T10:01:24.0932378Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.0934710Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0935712Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0936807Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0938909Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0941646Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0944368Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0947088Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0949875Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0952587Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.0955605Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.0958403Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.0961129Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0963845Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.0965519Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:24.0966039Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.0966344Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.0966549Z unimplemented [] 2025-12-04T10:01:24.0966759Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.0967163Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.0968851Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.0970390Z graph_break [] 2025-12-04T10:01:24.0970627Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.0970930Z Autotune Choices Stats: 2025-12-04T10:01:24.0972657Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.0974723Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0975368Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0976050Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.0977872Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0980505Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0983148Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.0985858Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.0988596Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0991222Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.0992951Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:24.0993390Z Autotune Choices Stats: 2025-12-04T10:01:24.0995214Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.0997459Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.0998576Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.0999666Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1001743Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1006032Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1010323Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1014617Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1018897Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1023326Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1027764Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1032055Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1036339Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1040613Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.1043262Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:24.1044088Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.1044576Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.1044917Z unimplemented [] 2025-12-04T10:01:24.1045269Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.1045901Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.1048544Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.1051028Z graph_break [] 2025-12-04T10:01:24.1051421Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.1051907Z Autotune Choices Stats: 2025-12-04T10:01:24.1054665Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:24.1057811Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1058648Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1059699Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1062381Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1066422Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1070438Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.1074632Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.1078803Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1082968Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1085760Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:24.1086464Z Autotune Choices Stats: 2025-12-04T10:01:24.1089304Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.1092851Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1094381Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1096129Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1099405Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1103723Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1108161Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1112460Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1116849Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1121199Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1125542Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1129845Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1134144Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.1138437Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1141102Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:24.1141929Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.1142431Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.1142767Z unimplemented [] 2025-12-04T10:01:24.1143126Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.1143770Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.1146410Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.1148984Z graph_break [] 2025-12-04T10:01:24.1149375Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.1149865Z Autotune Choices Stats: 2025-12-04T10:01:24.1152610Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.1155909Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1156889Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1157972Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1160706Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1164876Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1169024Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.1173187Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.1177344Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1181706Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1184296Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:24.1184997Z Autotune Choices Stats: 2025-12-04T10:01:24.1187897Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.1191384Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1192901Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1194652Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1197914Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1202211Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1206545Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1210849Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1215279Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1219625Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1223923Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1228305Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1232598Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.1236886Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1239545Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:24.1240367Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.1240863Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.1241323Z unimplemented [] 2025-12-04T10:01:24.1241668Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.1242005Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.1244221Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.1244336Z graph_break [] 2025-12-04T10:01:24.1244565Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.1244690Z Autotune Choices Stats: 2025-12-04T10:01:24.1247244Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:24.1247704Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1248096Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1248663Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1250706Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1252718Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1254735Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1256864Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.1259053Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.1261153Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1261674Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:24.1261798Z Autotune Choices Stats: 2025-12-04T10:01:24.1264373Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.1265175Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1265769Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1266797Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1268954Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1271055Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1273140Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1275371Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1277514Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1279604Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1281696Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1283786Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1285876Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1287957Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1288499Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:24.1288736Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.1288861Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.1288972Z unimplemented [] 2025-12-04T10:01:24.1289148Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.1289482Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.1291688Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.1291805Z graph_break [] 2025-12-04T10:01:24.1292078Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.1292203Z Autotune Choices Stats: 2025-12-04T10:01:24.1294709Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.1295158Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1295551Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1296120Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1298158Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1300171Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1302199Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.1304331Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.1306375Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1308498Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1308946Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:24.1309067Z Autotune Choices Stats: 2025-12-04T10:01:24.1311648Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.1312451Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1313035Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1314056Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1316158Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1318257Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1320435Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1322575Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1324704Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1326800Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1328896Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1330977Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1333070Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1335151Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.1335688Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:24.1335912Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.1336080Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.1336192Z unimplemented [] 2025-12-04T10:01:24.1336365Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.1336699Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.1338917Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.1339032Z graph_break [] 2025-12-04T10:01:24.1339259Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.1339372Z Autotune Choices Stats: 2025-12-04T10:01:24.1341889Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.1342335Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1342721Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1343281Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1345314Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1347404Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1349370Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1351708Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.1353876Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.1356073Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1356538Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:24.1356673Z Autotune Choices Stats: 2025-12-04T10:01:24.1359146Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.1359903Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1360454Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1361465Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1363571Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1365897Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1368008Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1370128Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1372175Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1374207Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1376232Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1378255Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1380282Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.1382441Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1382893Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:24.1383135Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.1383263Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.1383373Z unimplemented [] 2025-12-04T10:01:24.1383541Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.1383914Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.1386019Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.1386129Z graph_break [] 2025-12-04T10:01:24.1386349Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.1386466Z Autotune Choices Stats: 2025-12-04T10:01:24.1388976Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.1389406Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1389804Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1390348Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1392300Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1394244Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1396418Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1398437Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.1400501Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.1402526Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1402970Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:24.1403091Z Autotune Choices Stats: 2025-12-04T10:01:24.1405660Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.1406509Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1407096Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1408120Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1410219Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1412502Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1414637Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1416746Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1418838Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1420932Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1423025Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1425109Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1427368Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1429508Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.1430017Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:24.1430255Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.1430379Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.1430490Z unimplemented [] 2025-12-04T10:01:24.1430669Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.1431006Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.1433182Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.1433300Z graph_break [] 2025-12-04T10:01:24.1433527Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.1433643Z Autotune Choices Stats: 2025-12-04T10:01:24.1436161Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.1436606Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1437005Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1437575Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1439610Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1441718Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1443788Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.1445859Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.1447873Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1449912Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1450354Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:24.1450475Z Autotune Choices Stats: 2025-12-04T10:01:24.1453062Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.1453865Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1454461Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1455769Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1457955Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1460114Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1462212Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1464305Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1466402Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1468597Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1470692Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1472946Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1475089Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1477223Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.1477674Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:24.1477904Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.1478030Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.1478139Z unimplemented [] 2025-12-04T10:01:24.1478314Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.1478657Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.1480839Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.1480952Z graph_break [] 2025-12-04T10:01:24.1481179Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.1481293Z Autotune Choices Stats: 2025-12-04T10:01:24.1483799Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.1484235Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1484634Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1485202Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1487357Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1489414Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1491490Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.1493512Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.1495531Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1497560Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1498010Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:24.1498134Z Autotune Choices Stats: 2025-12-04T10:01:24.1500710Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.1501502Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1502185Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1503205Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1505356Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1507600Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1509713Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1511825Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1513911Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1516015Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1518106Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1520325Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1522463Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1524559Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1525014Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:24.1525246Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.1525369Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.1525480Z unimplemented [] 2025-12-04T10:01:24.1525656Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.1525993Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.1528157Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.1528271Z graph_break [] 2025-12-04T10:01:24.1528499Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.1528614Z Autotune Choices Stats: 2025-12-04T10:01:24.1531127Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.1531571Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1532072Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1532640Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1534722Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1536771Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1538803Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.1540811Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1542842Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.1544877Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1545322Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:24.1545436Z Autotune Choices Stats: 2025-12-04T10:01:24.1548071Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.1548975Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1550371Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1551415Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1553585Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1555836Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1557932Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1560027Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1562121Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1564205Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1566503Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1568648Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1570744Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.1572821Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1573280Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:24.1573509Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.1573635Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.1573745Z unimplemented [] 2025-12-04T10:01:24.1573924Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.1574261Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.1576434Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.1576548Z graph_break [] 2025-12-04T10:01:24.1576774Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.1576890Z Autotune Choices Stats: 2025-12-04T10:01:24.1579406Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.1579937Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1580332Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1580902Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1582990Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1585076Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1587098Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.1589188Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1591214Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.1593246Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1593688Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:24.1593799Z Autotune Choices Stats: 2025-12-04T10:01:24.1596541Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.1597380Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1597965Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1599031Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1601137Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1603232Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1605324Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1607419Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1609500Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1611676Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1613817Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1615953Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1618044Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1620138Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.1620594Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:24.1620822Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.1620952Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.1621058Z unimplemented [] 2025-12-04T10:01:24.1621233Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.1621571Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.1623736Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.1623845Z graph_break [] 2025-12-04T10:01:24.1624071Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.1624185Z Autotune Choices Stats: 2025-12-04T10:01:24.1626790Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.1627337Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1627736Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1628299Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1630376Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1632382Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1634404Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1636419Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.1638446Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.1640464Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1640991Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:24.1641105Z Autotune Choices Stats: 2025-12-04T10:01:24.1643747Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.1644542Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1645179Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1646213Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1648324Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1650433Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1652527Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1654632Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1656863Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1659136Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1661288Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1663379Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1665424Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1667626Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.1668081Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:24.1668317Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.1668440Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.1668548Z unimplemented [] 2025-12-04T10:01:24.1668724Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.1669060Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.1671231Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.1671459Z graph_break [] 2025-12-04T10:01:24.1671692Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.1671822Z Autotune Choices Stats: 2025-12-04T10:01:24.1674372Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.1674822Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1675248Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1675830Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1677859Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1679885Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1681909Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1683939Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.1685956Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.1688066Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1688509Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:24.1688671Z Autotune Choices Stats: 2025-12-04T10:01:24.1691291Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:24.1692092Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1692680Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1693715Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1695828Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1697917Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1700007Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1702102Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1704348Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1706483Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1708708Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1710788Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1712884Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1714978Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.1715417Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:24.1715657Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.1715775Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.1715885Z unimplemented [] 2025-12-04T10:01:24.1716065Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.1716394Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.1718650Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.1718761Z graph_break [] 2025-12-04T10:01:24.1718987Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.1719151Z Autotune Choices Stats: 2025-12-04T10:01:24.1721697Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.1722151Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1722534Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1723109Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1725139Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1727156Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1729172Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1731202Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.1733225Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.1735378Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1735825Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:24.1735949Z Autotune Choices Stats: 2025-12-04T10:01:24.1738568Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.1739368Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1739968Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1740997Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1743094Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1745205Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1747355Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1749549Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1751680Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1753816Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1756055Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1758148Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1760241Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1762330Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1762778Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:24.1763178Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.1763305Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.1763414Z unimplemented [] 2025-12-04T10:01:24.1763599Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.1763927Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.1766156Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.1766271Z graph_break [] 2025-12-04T10:01:24.1766507Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.1766633Z Autotune Choices Stats: 2025-12-04T10:01:24.1769203Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.1769652Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1770039Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1770616Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1772652Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1774675Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1776702Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1778728Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.1780878Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.1782940Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1783388Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:24.1783509Z Autotune Choices Stats: 2025-12-04T10:01:24.1786092Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.1786896Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1787576Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1788608Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1790720Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1792823Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1794998Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1797131Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1799266Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1801358Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1803453Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1805530Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1807621Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1809707Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.1810241Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:24.1810466Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.1810588Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.1810699Z unimplemented [] 2025-12-04T10:01:24.1810879Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.1811250Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.1813475Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.1813589Z graph_break [] 2025-12-04T10:01:24.1813812Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.1813933Z Autotune Choices Stats: 2025-12-04T10:01:24.1816439Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.1816885Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1817270Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1817844Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1819877Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1821894Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1823917Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.1826039Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.1828174Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1830238Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1830683Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:24.1830808Z Autotune Choices Stats: 2025-12-04T10:01:24.1833389Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.1834195Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1834781Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1835805Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1837933Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1840038Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1842333Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1844516Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1846580Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1848744Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1850884Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1852920Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1855070Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1857572Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.1858157Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:24.1858414Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.1858536Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.1858647Z unimplemented [] 2025-12-04T10:01:24.1858844Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.1859124Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.1861437Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.1861563Z graph_break [] 2025-12-04T10:01:24.1861803Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.1861934Z Autotune Choices Stats: 2025-12-04T10:01:24.1864313Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.1864795Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1865193Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1865788Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1867844Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1869940Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1872138Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.1874307Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.1876467Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1878550Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1879030Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:24.1879158Z Autotune Choices Stats: 2025-12-04T10:01:24.1881715Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:24.1882521Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1883111Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1884125Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1886241Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1888510Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1890645Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1892744Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1894900Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1896988Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1899086Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1901191Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1903282Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1905524Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.1905980Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:24.1906219Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.1906386Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.1906499Z unimplemented [] 2025-12-04T10:01:24.1906682Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.1907009Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.1909265Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.1909375Z graph_break [] 2025-12-04T10:01:24.1909605Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.1909727Z Autotune Choices Stats: 2025-12-04T10:01:24.1912244Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.1912691Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1913086Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1913668Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1915699Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1917720Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1919877Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.1922112Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1924147Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.1926182Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1926646Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:24.1926762Z Autotune Choices Stats: 2025-12-04T10:01:24.1929353Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.1930159Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1930747Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1931775Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1934041Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1936200Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1938346Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1940431Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1942540Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1944638Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1946746Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1948907Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1951124Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1953259Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1953709Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:24.1953945Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.1954061Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.1954170Z unimplemented [] 2025-12-04T10:01:24.1954358Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.1954690Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.1957019Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.1957131Z graph_break [] 2025-12-04T10:01:24.1957356Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.1957485Z Autotune Choices Stats: 2025-12-04T10:01:24.1959987Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.1960442Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1960826Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1961401Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1963425Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1965654Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1967749Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.1969780Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1971803Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1973846Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.1974290Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:24.1974408Z Autotune Choices Stats: 2025-12-04T10:01:24.1976999Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.1977800Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.1978392Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.1979518Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.1981662Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1983819Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1985923Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1988096Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1990189Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.1992280Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.1994382Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.1996616Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.1998749Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2000890Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2001332Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:24.2001565Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2001686Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2001794Z unimplemented [] 2025-12-04T10:01:24.2001977Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2002308Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2004480Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2004585Z graph_break [] 2025-12-04T10:01:24.2004808Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2004933Z Autotune Choices Stats: 2025-12-04T10:01:24.2007441Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.2007887Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2008273Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2008938Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2010973Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2013029Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2015097Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2016514Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2017799Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2019095Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2019382Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:24.2019449Z Autotune Choices Stats: 2025-12-04T10:01:24.2021110Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:24.2021705Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2022073Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2022753Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2024138Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2025483Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2026818Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2028268Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2029601Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2030927Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2032327Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2033684Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2035057Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2036396Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2036685Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:24.2036831Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2036900Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2036963Z unimplemented [] 2025-12-04T10:01:24.2037074Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2037278Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2038670Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2038734Z graph_break [] 2025-12-04T10:01:24.2038870Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2038944Z Autotune Choices Stats: 2025-12-04T10:01:24.2040555Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2040929Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2041172Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2041527Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2042856Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2044185Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2045466Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2046763Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2048051Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2049345Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2049622Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:24.2049698Z Autotune Choices Stats: 2025-12-04T10:01:24.2051343Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.2051935Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2052328Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2052980Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2054387Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2056003Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2057356Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2058687Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2060026Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2061348Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2062843Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2064209Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2065555Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2066895Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2067175Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:24.2067404Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2067476Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2067538Z unimplemented [] 2025-12-04T10:01:24.2067663Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2067868Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2069264Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2069326Z graph_break [] 2025-12-04T10:01:24.2069461Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2069539Z Autotune Choices Stats: 2025-12-04T10:01:24.2071138Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_574", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2071508Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2071752Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2072190Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2073526Z triton_flex_attention_574 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2074823Z triton_flex_attention_575 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2076107Z triton_flex_attention_572 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2077406Z triton_flex_attention_570 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2078685Z triton_flex_attention_573 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2079973Z triton_flex_attention_571 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2080317Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3286 seconds precompiling for 6 choices 2025-12-04T10:01:24.2080391Z Autotune Choices Stats: 2025-12-04T10:01:24.2082078Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014303999952971935, "best_triton_pos": 0} 2025-12-04T10:01:24.2082601Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2082996Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2083820Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2085538Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2086879Z triton_flex_attention_backward_576 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2088206Z triton_flex_attention_backward_577 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2089543Z triton_flex_attention_backward_578 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2090867Z triton_flex_attention_backward_581 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2092306Z triton_flex_attention_backward_583 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2093670Z triton_flex_attention_backward_580 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2094994Z triton_flex_attention_backward_582 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2096445Z triton_flex_attention_backward_585 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2097787Z triton_flex_attention_backward_588 0.0174 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.2098068Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2260 seconds precompiling for 13 choices 2025-12-04T10:01:24.2098217Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2098286Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2098349Z unimplemented [] 2025-12-04T10:01:24.2098463Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2098669Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2100061Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2100211Z graph_break [] 2025-12-04T10:01:24.2100344Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2100415Z Autotune Choices Stats: 2025-12-04T10:01:24.2102060Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2102354Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2102594Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2102986Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2104281Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2105572Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2106862Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2108219Z triton_flex_attention_589 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2109513Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2110800Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2111168Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.2961 seconds precompiling for 6 choices 2025-12-04T10:01:24.2111243Z Autotune Choices Stats: 2025-12-04T10:01:24.2112915Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.2113469Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2113831Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2114481Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2115820Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2117166Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2118500Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2119839Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2121235Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2122593Z triton_flex_attention_backward_601 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2123976Z triton_flex_attention_backward_602 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2125306Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2126635Z triton_flex_attention_backward_604 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2127967Z triton_flex_attention_backward_603 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2128249Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.3242 seconds precompiling for 13 choices 2025-12-04T10:01:24.2128389Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2128458Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2128519Z unimplemented [] 2025-12-04T10:01:24.2128631Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2128836Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2130220Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2130360Z graph_break [] 2025-12-04T10:01:24.2130496Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2130568Z Autotune Choices Stats: 2025-12-04T10:01:24.2132197Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.2132526Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2132770Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2133123Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2134420Z triton_flex_attention_612 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2135710Z triton_flex_attention_613 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2136994Z triton_flex_attention_608 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2138289Z triton_flex_attention_610 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2139573Z triton_flex_attention_611 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2140928Z triton_flex_attention_609 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2141237Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3064 seconds precompiling for 6 choices 2025-12-04T10:01:24.2141310Z Autotune Choices Stats: 2025-12-04T10:01:24.2142985Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_616", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.2143513Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2143873Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2144526Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2145865Z triton_flex_attention_backward_616 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2147284Z triton_flex_attention_backward_615 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2148622Z triton_flex_attention_backward_617 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2149951Z triton_flex_attention_backward_614 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2151399Z triton_flex_attention_backward_619 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2152753Z triton_flex_attention_backward_620 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2154082Z triton_flex_attention_backward_618 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2155628Z triton_flex_attention_backward_621 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2156970Z triton_flex_attention_backward_623 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2158303Z triton_flex_attention_backward_622 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2158586Z SingleProcess AUTOTUNE benchmarking takes 0.6701 seconds and 2.3164 seconds precompiling for 13 choices 2025-12-04T10:01:24.2158722Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2158800Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2158995Z unimplemented [] 2025-12-04T10:01:24.2159107Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2159308Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2160717Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2160831Z graph_break [] 2025-12-04T10:01:24.2160969Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2161042Z Autotune Choices Stats: 2025-12-04T10:01:24.2162695Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2162988Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2163235Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2163667Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2165291Z triton_flex_attention_631 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2166587Z triton_flex_attention_632 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2167871Z triton_flex_attention_629 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2169158Z triton_flex_attention_627 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2170551Z triton_flex_attention_630 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2171881Z triton_flex_attention_628 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2172167Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.2948 seconds precompiling for 6 choices 2025-12-04T10:01:24.2172272Z Autotune Choices Stats: 2025-12-04T10:01:24.2173946Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.2174468Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2174832Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2175481Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2176827Z triton_flex_attention_backward_634 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2178172Z triton_flex_attention_backward_635 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2179508Z triton_flex_attention_backward_636 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2180934Z triton_flex_attention_backward_633 0.0144 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2182312Z triton_flex_attention_backward_638 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2183643Z triton_flex_attention_backward_640 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2184976Z triton_flex_attention_backward_637 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2186315Z triton_flex_attention_backward_639 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2187739Z triton_flex_attention_backward_642 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2189066Z triton_flex_attention_backward_641 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2189418Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.2589 seconds precompiling for 13 choices 2025-12-04T10:01:24.2189562Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2189633Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2189697Z unimplemented [] 2025-12-04T10:01:24.2189812Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2190016Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2191450Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2191517Z graph_break [] 2025-12-04T10:01:24.2191652Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2191757Z Autotune Choices Stats: 2025-12-04T10:01:24.2193362Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2193653Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2193895Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2194259Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2195555Z triton_flex_attention_650 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2196921Z triton_flex_attention_651 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2198262Z triton_flex_attention_646 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2199554Z triton_flex_attention_648 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2200943Z triton_flex_attention_649 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2202268Z triton_flex_attention_647 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2202555Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3235 seconds precompiling for 6 choices 2025-12-04T10:01:24.2202625Z Autotune Choices Stats: 2025-12-04T10:01:24.2204280Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_653", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.2204802Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2205165Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2205826Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2207173Z triton_flex_attention_backward_653 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2208504Z triton_flex_attention_backward_654 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2209915Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2211311Z triton_flex_attention_backward_652 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2212675Z triton_flex_attention_backward_656 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2214002Z triton_flex_attention_backward_657 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2215338Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2216680Z triton_flex_attention_backward_658 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2218009Z triton_flex_attention_backward_661 0.0164 ms 80.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2219344Z triton_flex_attention_backward_660 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2219696Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.4000 seconds precompiling for 13 choices 2025-12-04T10:01:24.2219841Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2219915Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2220029Z unimplemented [] 2025-12-04T10:01:24.2220144Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2220349Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2221789Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2221851Z graph_break [] 2025-12-04T10:01:24.2221987Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2222061Z Autotune Choices Stats: 2025-12-04T10:01:24.2223661Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2223955Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2224193Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2224553Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2225863Z triton_flex_attention_669 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2227160Z triton_flex_attention_670 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2228499Z triton_flex_attention_665 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2230293Z triton_flex_attention_667 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2231639Z triton_flex_attention_668 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2232936Z triton_flex_attention_666 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2233217Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.2940 seconds precompiling for 6 choices 2025-12-04T10:01:24.2233294Z Autotune Choices Stats: 2025-12-04T10:01:24.2234954Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.2235484Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2235843Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2236495Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2237838Z triton_flex_attention_backward_673 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2239242Z triton_flex_attention_backward_674 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2240601Z triton_flex_attention_backward_671 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2241962Z triton_flex_attention_backward_672 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2243305Z triton_flex_attention_backward_676 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2244644Z triton_flex_attention_backward_675 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2245973Z triton_flex_attention_backward_677 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2247312Z triton_flex_attention_backward_678 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2248639Z triton_flex_attention_backward_680 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2250069Z triton_flex_attention_backward_683 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.2250350Z SingleProcess AUTOTUNE benchmarking takes 0.6679 seconds and 2.3879 seconds precompiling for 13 choices 2025-12-04T10:01:24.2250491Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2250565Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2250627Z unimplemented [] 2025-12-04T10:01:24.2250735Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2250982Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2252379Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2252440Z graph_break [] 2025-12-04T10:01:24.2252572Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2252648Z Autotune Choices Stats: 2025-12-04T10:01:24.2254252Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_689", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010367999784648418, "best_triton_pos": 0} 2025-12-04T10:01:24.2254547Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2254782Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2255147Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2256761Z triton_flex_attention_689 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2258055Z triton_flex_attention_688 0.0113 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2259495Z triton_flex_attention_684 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2260836Z triton_flex_attention_686 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2262178Z triton_flex_attention_687 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2263468Z triton_flex_attention_685 0.0164 ms 63.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2263749Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3120 seconds precompiling for 6 choices 2025-12-04T10:01:24.2263823Z Autotune Choices Stats: 2025-12-04T10:01:24.2265465Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_690", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.2266003Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2266367Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2267013Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2268427Z triton_flex_attention_backward_690 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2269875Z triton_flex_attention_backward_691 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2271247Z triton_flex_attention_backward_692 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2272948Z triton_flex_attention_backward_693 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2274451Z triton_flex_attention_backward_695 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2275780Z triton_flex_attention_backward_696 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2277134Z triton_flex_attention_backward_697 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2278863Z triton_flex_attention_backward_694 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2281176Z triton_flex_attention_backward_699 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2283378Z triton_flex_attention_backward_702 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.2283842Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.3417 seconds precompiling for 13 choices 2025-12-04T10:01:24.2284188Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:24.2284329Z Traceback (most recent call last): 2025-12-04T10:01:24.2284876Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:24.2285000Z self.assertTrue( 2025-12-04T10:01:24.2285344Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:24.2285483Z raise self.failureException(msg) 2025-12-04T10:01:24.2285907Z AssertionError: False is not true : Log file /tmp/tmpvow2h57n/flex_attention_configs.json was not created 2025-12-04T10:01:24.2285919Z 2025-12-04T10:01:24.2286138Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:24.2286623Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:24.2286630Z 2025-12-04T10:01:24.2286895Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:24.2287131Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2287247Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2287333Z unimplemented [] 2025-12-04T10:01:24.2287500Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2289532Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:24.2289825Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2289935Z graph_break [] 2025-12-04T10:01:24.2290167Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2291981Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:24.2292131Z current_size = base.storage().size() 2025-12-04T10:01:24.2292247Z Autotune Choices Stats: 2025-12-04T10:01:24.2294962Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:24.2295439Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2295901Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2296425Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2298383Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2300256Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2302161Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2304060Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2305959Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2308056Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2308670Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:24.2308801Z Autotune Choices Stats: 2025-12-04T10:01:24.2311477Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.2312261Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2312690Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2313339Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2314697Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2316046Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2317378Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2318711Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2320038Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2321503Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2322873Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2324200Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2325536Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2326868Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2327165Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:24.2327307Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2327387Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2327452Z unimplemented [] 2025-12-04T10:01:24.2327560Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2327772Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2329175Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2329315Z graph_break [] 2025-12-04T10:01:24.2329451Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2329522Z Autotune Choices Stats: 2025-12-04T10:01:24.2331173Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.2331463Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2331716Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2332107Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2333423Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2334708Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2335999Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2337303Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2338586Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2339938Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2340221Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:24.2340296Z Autotune Choices Stats: 2025-12-04T10:01:24.2342030Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2342562Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2342926Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2343574Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2344919Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2346250Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2347677Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2349010Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2350445Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2351818Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2365860Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2367282Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2368636Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2369959Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2370271Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:24.2370422Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2370504Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2370573Z unimplemented [] 2025-12-04T10:01:24.2370686Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2370907Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2372489Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2372561Z graph_break [] 2025-12-04T10:01:24.2372705Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2372777Z Autotune Choices Stats: 2025-12-04T10:01:24.2374479Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.2374779Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2375035Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2375392Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2376728Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2378010Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2379294Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2380570Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2381837Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2383217Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2383502Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:24.2383576Z Autotune Choices Stats: 2025-12-04T10:01:24.2385258Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2385795Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2386157Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2386790Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2388232Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2389566Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2390891Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2392289Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2393834Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2395219Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2396543Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2397857Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2399173Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.2400487Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2400778Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:24.2400922Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2401068Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2401134Z unimplemented [] 2025-12-04T10:01:24.2401244Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2401464Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2402919Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2402987Z graph_break [] 2025-12-04T10:01:24.2403126Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2403199Z Autotune Choices Stats: 2025-12-04T10:01:24.2404840Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.2405131Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2405380Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2405741Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2407045Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2408315Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2409596Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2410875Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2412210Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2413522Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2413851Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:24.2413924Z Autotune Choices Stats: 2025-12-04T10:01:24.2415569Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2416090Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2416458Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2417100Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2418434Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2419763Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2421086Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2422514Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2423859Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2425189Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2426513Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2427910Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2429233Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2430545Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.2430906Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:24.2431047Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2431124Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2431189Z unimplemented [] 2025-12-04T10:01:24.2431295Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2431508Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2432925Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2433004Z graph_break [] 2025-12-04T10:01:24.2433180Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2433249Z Autotune Choices Stats: 2025-12-04T10:01:24.2434850Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.2435142Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2435394Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2435756Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2437056Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2438332Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2439612Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2440965Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2442267Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2443580Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2443863Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:24.2443935Z Autotune Choices Stats: 2025-12-04T10:01:24.2445577Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2446105Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2446482Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2447119Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2448456Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2449782Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2451207Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2452563Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2453878Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2455392Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2456734Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2458047Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2459370Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.2460799Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2461086Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:24.2461270Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2461342Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2461416Z unimplemented [] 2025-12-04T10:01:24.2461523Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2461734Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2463175Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2463249Z graph_break [] 2025-12-04T10:01:24.2463389Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2463460Z Autotune Choices Stats: 2025-12-04T10:01:24.2465054Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.2465342Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2465589Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2465940Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2467323Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2468608Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2469882Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2471255Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2472562Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2473841Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2474128Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:24.2474200Z Autotune Choices Stats: 2025-12-04T10:01:24.2475862Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2476382Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2476754Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2477389Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2478732Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2480142Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2481509Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2482899Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2484233Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2485564Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2486894Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2488221Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2489544Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2490966Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2491260Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:24.2491400Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2491471Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2491543Z unimplemented [] 2025-12-04T10:01:24.2491686Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2491903Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2493297Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2493360Z graph_break [] 2025-12-04T10:01:24.2493504Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2493572Z Autotune Choices Stats: 2025-12-04T10:01:24.2495352Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.2495644Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2495898Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2496251Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2497549Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2498815Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2500199Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2501511Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2502794Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2504076Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2504364Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:24.2504432Z Autotune Choices Stats: 2025-12-04T10:01:24.2506093Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.2506616Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2506988Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2507718Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2509057Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2510484Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2511868Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2513199Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2514521Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2515846Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2517173Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2518499Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2519890Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.2521241Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2521566Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:24.2521706Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2521778Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2521851Z unimplemented [] 2025-12-04T10:01:24.2521958Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2522170Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2523565Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2523630Z graph_break [] 2025-12-04T10:01:24.2523771Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2523838Z Autotune Choices Stats: 2025-12-04T10:01:24.2525449Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:24.2525743Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2525989Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2526344Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2527639Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2528987Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2530302Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2531608Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2532887Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2534168Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2534452Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:24.2534519Z Autotune Choices Stats: 2025-12-04T10:01:24.2536167Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2536687Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2537050Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2537750Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2539126Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2540477Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2541817Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2543141Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2544458Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2545784Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2547106Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2548559Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2549936Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2551286Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.2551573Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:24.2551707Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2551775Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2551845Z unimplemented [] 2025-12-04T10:01:24.2551951Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2552154Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2553551Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2553614Z graph_break [] 2025-12-04T10:01:24.2553755Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2553822Z Autotune Choices Stats: 2025-12-04T10:01:24.2555590Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:24.2555889Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2556134Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2556496Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2557899Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2559214Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2560552Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2561829Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2563114Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2564389Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2564675Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:24.2564747Z Autotune Choices Stats: 2025-12-04T10:01:24.2566396Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2566924Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2567363Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2568000Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2569381Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2570747Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2572078Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2573415Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2575053Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2577011Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2579033Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2581245Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2583310Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2585324Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.2585782Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:24.2586011Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2586140Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2586232Z unimplemented [] 2025-12-04T10:01:24.2586355Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2586567Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2588058Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2588128Z graph_break [] 2025-12-04T10:01:24.2588274Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2588347Z Autotune Choices Stats: 2025-12-04T10:01:24.2589960Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.2590250Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2590641Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2591003Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2592336Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2593650Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2594932Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2596220Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2597497Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2598772Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2599063Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:24.2599133Z Autotune Choices Stats: 2025-12-04T10:01:24.2600785Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2601375Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2601773Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2602411Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2603784Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2605109Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2606451Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2607773Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2609085Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2610412Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2611818Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2613179Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2614502Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2615909Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.2616247Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:24.2616411Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2616502Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2616586Z unimplemented [] 2025-12-04T10:01:24.2616723Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2616967Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2618519Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2618584Z graph_break [] 2025-12-04T10:01:24.2618724Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2618792Z Autotune Choices Stats: 2025-12-04T10:01:24.2620382Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:24.2620766Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2621010Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2621395Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2622717Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2623992Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2625282Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2626779Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2628119Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2629401Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2629686Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:24.2629844Z Autotune Choices Stats: 2025-12-04T10:01:24.2631484Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2632037Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2632403Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2633077Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2634434Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2635770Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2637102Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2638422Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2639754Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2641142Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2642503Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2643861Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2645186Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.2646515Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2646800Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:24.2646938Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2647014Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2647085Z unimplemented [] 2025-12-04T10:01:24.2647194Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2647400Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2648801Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2648865Z graph_break [] 2025-12-04T10:01:24.2649004Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2649145Z Autotune Choices Stats: 2025-12-04T10:01:24.2650744Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.2651069Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2651311Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2651672Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2652995Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2654271Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2655852Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2657142Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2658425Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2659706Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2660134Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:24.2660206Z Autotune Choices Stats: 2025-12-04T10:01:24.2661897Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2662487Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2662856Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2663502Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2664845Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2666169Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2667563Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2668892Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2670288Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2671649Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2673002Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2674330Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2675650Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.2676984Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2677277Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:24.2677413Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2677485Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2677554Z unimplemented [] 2025-12-04T10:01:24.2677659Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2677867Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2679261Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2679391Z graph_break [] 2025-12-04T10:01:24.2679542Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2679612Z Autotune Choices Stats: 2025-12-04T10:01:24.2681250Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:24.2681539Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2681810Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2682172Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2683458Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2684741Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2686029Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2687315Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2688600Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2689946Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2690266Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:24.2690337Z Autotune Choices Stats: 2025-12-04T10:01:24.2692448Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.2693220Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2693780Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2694719Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2696777Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2698898Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2700911Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2702839Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2704625Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2706242Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2708090Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2710013Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2711934Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2713394Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2713840Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:24.2714070Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2714187Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2714294Z unimplemented [] 2025-12-04T10:01:24.2714409Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2714710Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2716414Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2716485Z graph_break [] 2025-12-04T10:01:24.2716682Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2716759Z Autotune Choices Stats: 2025-12-04T10:01:24.2718717Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.2719131Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2719504Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2720014Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2721540Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2723070Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2724662Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2726287Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2727965Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2729644Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2729972Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:24.2730047Z Autotune Choices Stats: 2025-12-04T10:01:24.2732271Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2733027Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2733562Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2734202Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2735565Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2736899Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2738227Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2739617Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2740985Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2742353Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2743670Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2744994Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2746310Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2747744Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.2748030Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:24.2748268Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2748342Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2748425Z unimplemented [] 2025-12-04T10:01:24.2748539Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2748747Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2750184Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2750250Z graph_break [] 2025-12-04T10:01:24.2750391Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2750459Z Autotune Choices Stats: 2025-12-04T10:01:24.2752112Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.2752406Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2752649Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2753013Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2754305Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2755972Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2757289Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2758587Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2760080Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2761405Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2761699Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:24.2761769Z Autotune Choices Stats: 2025-12-04T10:01:24.2763428Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.2763956Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2764326Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2764968Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2766324Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2767658Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2769057Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2770415Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2771767Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2773093Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2774414Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2775746Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2777072Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.2778397Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2778743Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:24.2778889Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2778961Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2779031Z unimplemented [] 2025-12-04T10:01:24.2779141Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2779808Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2781248Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2781316Z graph_break [] 2025-12-04T10:01:24.2781475Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2781544Z Autotune Choices Stats: 2025-12-04T10:01:24.2783143Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.2783439Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2783682Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2784045Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2785335Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2786615Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2787970Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2789360Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2790674Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2791983Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2792273Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:24.2792342Z Autotune Choices Stats: 2025-12-04T10:01:24.2793994Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2794520Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2794890Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2795529Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2796875Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2798193Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2799625Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2800980Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2802307Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2803642Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2804955Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2806290Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2807601Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2808991Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.2809305Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:24.2809450Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2809521Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2809594Z unimplemented [] 2025-12-04T10:01:24.2809701Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2809910Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2811332Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2811395Z graph_break [] 2025-12-04T10:01:24.2811534Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2811602Z Autotune Choices Stats: 2025-12-04T10:01:24.2813192Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.2813485Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2813727Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2814097Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2815388Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2816681Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2818021Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2819336Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2820676Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2821961Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2822245Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:24.2822313Z Autotune Choices Stats: 2025-12-04T10:01:24.2823959Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2824478Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2824844Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2825474Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2826814Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2828313Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2829681Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2831009Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2832333Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2833661Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2834983Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2836304Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2837689Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2839041Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.2839327Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:24.2839509Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2839580Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2839649Z unimplemented [] 2025-12-04T10:01:24.2839755Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2839962Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2841362Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2841427Z graph_break [] 2025-12-04T10:01:24.2841571Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2841642Z Autotune Choices Stats: 2025-12-04T10:01:24.2843246Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.2843538Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2843779Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2844136Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2845425Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2846710Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2848087Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2849399Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2850686Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2851962Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2852249Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:24.2852320Z Autotune Choices Stats: 2025-12-04T10:01:24.2853984Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2854508Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2854880Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2855884Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2857382Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2858775Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2860160Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2861484Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2862818Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2864147Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2865481Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2866805Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2868325Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2869703Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2869992Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:24.2870140Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2870215Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2870279Z unimplemented [] 2025-12-04T10:01:24.2870392Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2870607Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2872016Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2872081Z graph_break [] 2025-12-04T10:01:24.2872224Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2872294Z Autotune Choices Stats: 2025-12-04T10:01:24.2873904Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.2874202Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2874445Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2874811Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2876109Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2877489Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2878802Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2880103Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2881387Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2882670Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2882953Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:24.2883027Z Autotune Choices Stats: 2025-12-04T10:01:24.2884673Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.2885195Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2885565Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2886277Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2887648Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2889008Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2890339Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2891667Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2892988Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2894312Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2895643Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2897101Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2898514Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.2899942Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2900228Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:24.2900377Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2900451Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2900515Z unimplemented [] 2025-12-04T10:01:24.2900626Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2900834Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2902223Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2902285Z graph_break [] 2025-12-04T10:01:24.2902429Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2902500Z Autotune Choices Stats: 2025-12-04T10:01:24.2904099Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.2904390Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2904632Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2905071Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2906554Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2908084Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2909410Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2910689Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2911977Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2913261Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2913552Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:24.2913621Z Autotune Choices Stats: 2025-12-04T10:01:24.2915270Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.2915878Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2916249Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2916930Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2918422Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2919759Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2921095Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2922423Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2923766Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2925089Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2926486Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2927847Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2929202Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2930526Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.2930814Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:24.2930961Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2931033Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2931099Z unimplemented [] 2025-12-04T10:01:24.2931212Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2931420Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2932822Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2932888Z graph_break [] 2025-12-04T10:01:24.2933033Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2933111Z Autotune Choices Stats: 2025-12-04T10:01:24.2934713Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.2935104Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2935348Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2935717Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2937048Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2938368Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2939652Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2940945Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2942231Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2943516Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2943802Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:24.2943871Z Autotune Choices Stats: 2025-12-04T10:01:24.2945518Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.2946106Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2946508Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2947154Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2948585Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2949913Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2951251Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2952575Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2953903Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2955428Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2956949Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2958332Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2959664Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2961000Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.2961285Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:24.2961442Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2961517Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2961588Z unimplemented [] 2025-12-04T10:01:24.2961700Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2961913Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2963309Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2963373Z graph_break [] 2025-12-04T10:01:24.2963515Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2963585Z Autotune Choices Stats: 2025-12-04T10:01:24.2965186Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.2965547Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2965832Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2966193Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2967534Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2968820Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2970097Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2971381Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.2972676Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.2973948Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2974297Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:24.2974366Z Autotune Choices Stats: 2025-12-04T10:01:24.2976053Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:24.2976574Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2976980Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2977631Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2978968Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2980304Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2981626Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2982966Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2984290Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2985713Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.2987075Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.2988480Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.2989797Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2991125Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.2991406Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:24.2991555Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.2991627Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.2991698Z unimplemented [] 2025-12-04T10:01:24.2991807Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.2992014Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.2993405Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.2993553Z graph_break [] 2025-12-04T10:01:24.2993692Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.2993765Z Autotune Choices Stats: 2025-12-04T10:01:24.2995397Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.2995687Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.2995932Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.2996336Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.2997626Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.2998907Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3000326Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3001620Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3002908Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3004189Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3004565Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:24.3004635Z Autotune Choices Stats: 2025-12-04T10:01:24.3006345Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.3006895Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3007264Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3007906Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3009250Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3010583Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3011913Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3013239Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3014632Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3015989Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3017348Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3018673Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3020017Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3021342Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3021625Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:24.3021770Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3021841Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3021907Z unimplemented [] 2025-12-04T10:01:24.3022032Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3022241Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3023892Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3024118Z graph_break [] 2025-12-04T10:01:24.3024343Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3024458Z Autotune Choices Stats: 2025-12-04T10:01:24.3026988Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.3027574Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3027957Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3028485Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3030456Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3032182Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3033464Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3034747Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3036034Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3037444Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3037736Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:24.3037808Z Autotune Choices Stats: 2025-12-04T10:01:24.3039495Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3040019Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3040395Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3041043Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3042381Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3043701Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3045028Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3046428Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3047781Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3049138Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3050467Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3051793Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3053120Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3054443Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.3054727Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:24.3054885Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3054962Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3055099Z unimplemented [] 2025-12-04T10:01:24.3055454Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3055682Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3057168Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3057237Z graph_break [] 2025-12-04T10:01:24.3057388Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3057457Z Autotune Choices Stats: 2025-12-04T10:01:24.3059124Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.3059423Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3059666Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3060030Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3061337Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3062621Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3063896Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3065171Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3066546Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3067927Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3068255Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:24.3068328Z Autotune Choices Stats: 2025-12-04T10:01:24.3069970Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3070495Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3070862Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3071509Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3072855Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3074191Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3075522Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3076947Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3078302Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3079628Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3080966Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3082281Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3083600Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3084920Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.3085289Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:24.3085435Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3085507Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3085577Z unimplemented [] 2025-12-04T10:01:24.3085692Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3085919Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3087357Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3087425Z graph_break [] 2025-12-04T10:01:24.3087594Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3087671Z Autotune Choices Stats: 2025-12-04T10:01:24.3089264Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.3089561Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3089807Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3090167Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3091457Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3092744Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3094023Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3095364Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3096689Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3097993Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3098283Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:24.3098350Z Autotune Choices Stats: 2025-12-04T10:01:24.3099991Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:24.3100517Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3100876Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3101521Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3102864Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3104205Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3105623Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3106981Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3108369Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3109687Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3111007Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3112330Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3113648Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3114972Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.3115321Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:24.3115463Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3115569Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3115635Z unimplemented [] 2025-12-04T10:01:24.3115749Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3115954Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3117405Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3117471Z graph_break [] 2025-12-04T10:01:24.3117607Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3117682Z Autotune Choices Stats: 2025-12-04T10:01:24.3119281Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.3119575Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3119813Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3120166Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3121457Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3122734Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3124008Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3125392Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3126716Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3128008Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3128295Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:24.3128366Z Autotune Choices Stats: 2025-12-04T10:01:24.3130017Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.3130544Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3130908Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3131554Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3132900Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3134305Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3135669Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3137035Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3138363Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3139679Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3141027Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3142356Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3143676Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3145111Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3145394Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:24.3145545Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3145618Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3145683Z unimplemented [] 2025-12-04T10:01:24.3145831Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3146037Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3147521Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3147586Z graph_break [] 2025-12-04T10:01:24.3147723Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3147801Z Autotune Choices Stats: 2025-12-04T10:01:24.3149405Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.3149695Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3149936Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3150297Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3151591Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3152874Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3154276Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3155918Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3157239Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3158538Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3158833Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:24.3158904Z Autotune Choices Stats: 2025-12-04T10:01:24.3160547Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.3161082Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3161448Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3162095Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3163439Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3164925Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3166297Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3167621Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3168957Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3170271Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3171598Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3172925Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3174311Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3175663Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3175980Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:24.3176127Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3176199Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3176265Z unimplemented [] 2025-12-04T10:01:24.3176379Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3176589Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3177987Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3178056Z graph_break [] 2025-12-04T10:01:24.3178192Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3178276Z Autotune Choices Stats: 2025-12-04T10:01:24.3179872Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.3180170Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3180412Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3180773Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3182065Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3183414Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3184730Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3186054Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3187402Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3188696Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3188978Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:24.3189054Z Autotune Choices Stats: 2025-12-04T10:01:24.3190705Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:24.3191233Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3191593Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3192342Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3193716Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3195090Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3196432Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3197761Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3199094Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3200419Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3201755Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3203153Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3204511Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3205890Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3206175Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:24.3206322Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3206395Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3206461Z unimplemented [] 2025-12-04T10:01:24.3206591Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3206802Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3208221Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3208285Z graph_break [] 2025-12-04T10:01:24.3208424Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3208503Z Autotune Choices Stats: 2025-12-04T10:01:24.3210099Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3210397Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3210642Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3211008Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3212374Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3213693Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3215015Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3216306Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3217584Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3218873Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3219157Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:24.3219234Z Autotune Choices Stats: 2025-12-04T10:01:24.3220883Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.3221411Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3221836Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3222494Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3223872Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3225250Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3226600Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3228036Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3229370Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3230707Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3232046Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3233492Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3234855Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3236195Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3236482Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:24.3236630Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3236702Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3236769Z unimplemented [] 2025-12-04T10:01:24.3236883Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3237091Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3238497Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3238574Z graph_break [] 2025-12-04T10:01:24.3238709Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3238784Z Autotune Choices Stats: 2025-12-04T10:01:24.3240408Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_574", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3240702Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3241012Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3241375Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3242710Z triton_flex_attention_574 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3244034Z triton_flex_attention_575 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3245321Z triton_flex_attention_572 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3246612Z triton_flex_attention_570 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3247889Z triton_flex_attention_573 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3249181Z triton_flex_attention_571 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3249461Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3286 seconds precompiling for 6 choices 2025-12-04T10:01:24.3249551Z Autotune Choices Stats: 2025-12-04T10:01:24.3251194Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014303999952971935, "best_triton_pos": 0} 2025-12-04T10:01:24.3251785Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3252178Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3252835Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3254209Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3255922Z triton_flex_attention_backward_576 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3257373Z triton_flex_attention_backward_577 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3258717Z triton_flex_attention_backward_578 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3260051Z triton_flex_attention_backward_581 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3261382Z triton_flex_attention_backward_583 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3262911Z triton_flex_attention_backward_580 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3264299Z triton_flex_attention_backward_582 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3265627Z triton_flex_attention_backward_585 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3269172Z triton_flex_attention_backward_588 0.0174 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.3269476Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2260 seconds precompiling for 13 choices 2025-12-04T10:01:24.3269622Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3269703Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3269768Z unimplemented [] 2025-12-04T10:01:24.3269892Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3270116Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3271536Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3271606Z graph_break [] 2025-12-04T10:01:24.3271746Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3271815Z Autotune Choices Stats: 2025-12-04T10:01:24.3273438Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3273791Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3274039Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3274433Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3275735Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3277045Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3278325Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3279679Z triton_flex_attention_589 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3280959Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3282236Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3282521Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.2961 seconds precompiling for 6 choices 2025-12-04T10:01:24.3282593Z Autotune Choices Stats: 2025-12-04T10:01:24.3284281Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.3284832Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3285205Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3285884Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3287332Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3288734Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3290062Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3291382Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3292697Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3294050Z triton_flex_attention_backward_601 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3295698Z triton_flex_attention_backward_602 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3297071Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3298400Z triton_flex_attention_backward_604 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3299753Z triton_flex_attention_backward_603 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3300043Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.3242 seconds precompiling for 13 choices 2025-12-04T10:01:24.3300200Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3300282Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3300346Z unimplemented [] 2025-12-04T10:01:24.3300455Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3300670Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3302054Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3302123Z graph_break [] 2025-12-04T10:01:24.3302261Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3302368Z Autotune Choices Stats: 2025-12-04T10:01:24.3303976Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.3304315Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3304562Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3304919Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3306260Z triton_flex_attention_612 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3307604Z triton_flex_attention_613 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3308926Z triton_flex_attention_608 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3310199Z triton_flex_attention_610 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3311479Z triton_flex_attention_611 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3312757Z triton_flex_attention_609 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3313075Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3064 seconds precompiling for 6 choices 2025-12-04T10:01:24.3313142Z Autotune Choices Stats: 2025-12-04T10:01:24.3314817Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_616", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.3315345Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3315742Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3316388Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3317728Z triton_flex_attention_backward_616 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3319080Z triton_flex_attention_backward_615 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3320399Z triton_flex_attention_backward_617 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3321716Z triton_flex_attention_backward_614 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3323036Z triton_flex_attention_backward_619 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3324425Z triton_flex_attention_backward_620 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3325785Z triton_flex_attention_backward_618 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3327100Z triton_flex_attention_backward_621 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3328450Z triton_flex_attention_backward_623 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3329765Z triton_flex_attention_backward_622 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3330056Z SingleProcess AUTOTUNE benchmarking takes 0.6701 seconds and 2.3164 seconds precompiling for 13 choices 2025-12-04T10:01:24.3330195Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3330265Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3330335Z unimplemented [] 2025-12-04T10:01:24.3330440Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3330652Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3332037Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3332143Z graph_break [] 2025-12-04T10:01:24.3332281Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3332348Z Autotune Choices Stats: 2025-12-04T10:01:24.3333981Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3334272Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3334549Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3334903Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3336202Z triton_flex_attention_631 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3337512Z triton_flex_attention_632 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3338796Z triton_flex_attention_629 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3340061Z triton_flex_attention_627 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3341337Z triton_flex_attention_630 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3342669Z triton_flex_attention_628 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3342949Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.2948 seconds precompiling for 6 choices 2025-12-04T10:01:24.3343049Z Autotune Choices Stats: 2025-12-04T10:01:24.3344713Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.3345236Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3345599Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3346279Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3347687Z triton_flex_attention_backward_634 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3349024Z triton_flex_attention_backward_635 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3350345Z triton_flex_attention_backward_636 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3351669Z triton_flex_attention_backward_633 0.0144 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3353054Z triton_flex_attention_backward_638 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3354404Z triton_flex_attention_backward_640 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3355968Z triton_flex_attention_backward_637 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3357412Z triton_flex_attention_backward_639 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3358747Z triton_flex_attention_backward_642 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3360071Z triton_flex_attention_backward_641 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3360363Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.2589 seconds precompiling for 13 choices 2025-12-04T10:01:24.3360505Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3360575Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3360646Z unimplemented [] 2025-12-04T10:01:24.3360756Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3360972Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3362422Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3362485Z graph_break [] 2025-12-04T10:01:24.3362624Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3362736Z Autotune Choices Stats: 2025-12-04T10:01:24.3364381Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3364684Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3364936Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3365291Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3366626Z triton_flex_attention_650 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3367901Z triton_flex_attention_651 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3369179Z triton_flex_attention_646 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3370452Z triton_flex_attention_648 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3371728Z triton_flex_attention_649 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3373073Z triton_flex_attention_647 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3373369Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3235 seconds precompiling for 6 choices 2025-12-04T10:01:24.3373437Z Autotune Choices Stats: 2025-12-04T10:01:24.3375113Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_653", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.3375633Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3376061Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3376699Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3378037Z triton_flex_attention_backward_653 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3379359Z triton_flex_attention_backward_654 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3380684Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3382041Z triton_flex_attention_backward_652 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3383395Z triton_flex_attention_backward_656 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3384753Z triton_flex_attention_backward_657 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3386070Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3387645Z triton_flex_attention_backward_658 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3388975Z triton_flex_attention_backward_661 0.0164 ms 80.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3390296Z triton_flex_attention_backward_660 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3390587Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.4000 seconds precompiling for 13 choices 2025-12-04T10:01:24.3390790Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3390860Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3390933Z unimplemented [] 2025-12-04T10:01:24.3391041Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3391252Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3392680Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3392746Z graph_break [] 2025-12-04T10:01:24.3392889Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3392956Z Autotune Choices Stats: 2025-12-04T10:01:24.3394590Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3394880Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3395166Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3395518Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3396818Z triton_flex_attention_669 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3398088Z triton_flex_attention_670 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3399359Z triton_flex_attention_665 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3400649Z triton_flex_attention_667 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3402029Z triton_flex_attention_668 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3403353Z triton_flex_attention_666 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3403650Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.2940 seconds precompiling for 6 choices 2025-12-04T10:01:24.3403718Z Autotune Choices Stats: 2025-12-04T10:01:24.3405371Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.3405939Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3406307Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3406944Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3408286Z triton_flex_attention_backward_673 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3409614Z triton_flex_attention_backward_674 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3410974Z triton_flex_attention_backward_671 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3412329Z triton_flex_attention_backward_672 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3413691Z triton_flex_attention_backward_676 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3415013Z triton_flex_attention_backward_675 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3416383Z triton_flex_attention_backward_677 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3417701Z triton_flex_attention_backward_678 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3419027Z triton_flex_attention_backward_680 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3420347Z triton_flex_attention_backward_683 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.3420674Z SingleProcess AUTOTUNE benchmarking takes 0.6679 seconds and 2.3879 seconds precompiling for 13 choices 2025-12-04T10:01:24.3420814Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3420885Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3420956Z unimplemented [] 2025-12-04T10:01:24.3421061Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3421302Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3422729Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3422801Z graph_break [] 2025-12-04T10:01:24.3422943Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3423011Z Autotune Choices Stats: 2025-12-04T10:01:24.3424608Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_689", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010367999784648418, "best_triton_pos": 0} 2025-12-04T10:01:24.3424933Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3425184Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3425539Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3426833Z triton_flex_attention_689 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3428170Z triton_flex_attention_688 0.0113 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3429446Z triton_flex_attention_684 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3430763Z triton_flex_attention_686 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3432068Z triton_flex_attention_687 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3433373Z triton_flex_attention_685 0.0164 ms 63.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3433657Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3120 seconds precompiling for 6 choices 2025-12-04T10:01:24.3433776Z Autotune Choices Stats: 2025-12-04T10:01:24.3435412Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_690", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.3435938Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3436305Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3436948Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3438311Z triton_flex_attention_backward_690 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3439640Z triton_flex_attention_backward_691 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3441035Z triton_flex_attention_backward_692 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3442395Z triton_flex_attention_backward_693 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3443720Z triton_flex_attention_backward_695 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3445082Z triton_flex_attention_backward_696 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3446399Z triton_flex_attention_backward_697 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3447721Z triton_flex_attention_backward_694 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3449044Z triton_flex_attention_backward_699 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3450400Z triton_flex_attention_backward_702 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.3450746Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.3417 seconds precompiling for 13 choices 2025-12-04T10:01:24.3450886Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3450958Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3451028Z unimplemented [] 2025-12-04T10:01:24.3451135Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3451339Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3452775Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3452842Z graph_break [] 2025-12-04T10:01:24.3452984Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3453087Z Autotune Choices Stats: 2025-12-04T10:01:24.3454689Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3454978Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3455521Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3455898Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3457193Z triton_flex_attention_707 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3458480Z triton_flex_attention_708 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3459855Z triton_flex_attention_705 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3461177Z triton_flex_attention_706 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3462518Z triton_flex_attention_703 0.0144 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3463795Z triton_flex_attention_704 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3464137Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.2972 seconds precompiling for 6 choices 2025-12-04T10:01:24.3464207Z Autotune Choices Stats: 2025-12-04T10:01:24.3465861Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_709", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.3466378Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3466748Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3467449Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3468784Z triton_flex_attention_backward_709 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3470194Z triton_flex_attention_backward_710 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3471547Z triton_flex_attention_backward_711 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3472873Z triton_flex_attention_backward_712 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3474191Z triton_flex_attention_backward_714 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3475547Z triton_flex_attention_backward_715 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3476866Z triton_flex_attention_backward_713 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3478190Z triton_flex_attention_backward_716 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3479536Z triton_flex_attention_backward_721 0.0174 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.3480889Z triton_flex_attention_backward_717 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3481181Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.1509 seconds precompiling for 13 choices 2025-12-04T10:01:24.3481390Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:24.3481505Z Traceback (most recent call last): 2025-12-04T10:01:24.3481860Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:24.3481925Z self.assertTrue( 2025-12-04T10:01:24.3482155Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:24.3482242Z raise self.failureException(msg) 2025-12-04T10:01:24.3482521Z AssertionError: False is not true : Log file /tmp/tmpd6qqrq76/flex_attention_configs.json was not created 2025-12-04T10:01:24.3482526Z 2025-12-04T10:01:24.3482711Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:24.3483006Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:24.3483012Z 2025-12-04T10:01:24.3483190Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:24.3483341Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3483414Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3483486Z unimplemented [] 2025-12-04T10:01:24.3483596Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3484999Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:24.3485222Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3485283Z graph_break [] 2025-12-04T10:01:24.3485426Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3486589Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:24.3486672Z current_size = base.storage().size() 2025-12-04T10:01:24.3486749Z Autotune Choices Stats: 2025-12-04T10:01:24.3488346Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:24.3488708Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3488981Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3489342Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3490654Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3491923Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3493225Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3494488Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3495771Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3497043Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3497370Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:24.3497439Z Autotune Choices Stats: 2025-12-04T10:01:24.3499115Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.3499639Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3500037Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3500771Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3502158Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3503533Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3504850Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3511657Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3513008Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3514459Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3515822Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3517133Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3518451Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3519793Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3520084Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:24.3520241Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3520317Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3520382Z unimplemented [] 2025-12-04T10:01:24.3520502Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3520713Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3522103Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3522213Z graph_break [] 2025-12-04T10:01:24.3522352Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3522430Z Autotune Choices Stats: 2025-12-04T10:01:24.3524065Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.3524366Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3524606Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3525003Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3526306Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3527568Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3528888Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3530159Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3531422Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3532693Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3533009Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:24.3533083Z Autotune Choices Stats: 2025-12-04T10:01:24.3534789Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3535365Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3535734Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3536374Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3537744Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3539070Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3540380Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3541693Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3543039Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3544378Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3545723Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3547026Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3548496Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3549809Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3550095Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:24.3550236Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3550313Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3550377Z unimplemented [] 2025-12-04T10:01:24.3550492Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3550698Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3552081Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3552196Z graph_break [] 2025-12-04T10:01:24.3552334Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3552407Z Autotune Choices Stats: 2025-12-04T10:01:24.3554036Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.3554364Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3554611Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3554964Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3556522Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3557920Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3559197Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3560481Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3561747Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3563079Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3563413Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:24.3563489Z Autotune Choices Stats: 2025-12-04T10:01:24.3565201Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3565734Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3566093Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3566776Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3568117Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3569439Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3570760Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3572078Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3573459Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3574807Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3576133Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3577475Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3578795Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.3580109Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3580397Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:24.3580540Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3580625Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3580692Z unimplemented [] 2025-12-04T10:01:24.3580842Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3581050Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3582440Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3582510Z graph_break [] 2025-12-04T10:01:24.3582683Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3582761Z Autotune Choices Stats: 2025-12-04T10:01:24.3584379Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.3584676Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3584919Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3585274Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3586598Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3587946Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3589218Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3590492Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3591800Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3593109Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3593402Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:24.3593479Z Autotune Choices Stats: 2025-12-04T10:01:24.3595149Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3595715Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3596080Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3596719Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3598048Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3599375Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3600686Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3602105Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3603457Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3604769Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3606126Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3607446Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3608763Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3610080Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.3610399Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:24.3610540Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3610617Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3610680Z unimplemented [] 2025-12-04T10:01:24.3610794Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3610999Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3612434Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3612507Z graph_break [] 2025-12-04T10:01:24.3612642Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3612725Z Autotune Choices Stats: 2025-12-04T10:01:24.3614348Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.3614671Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3614912Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3615264Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3616559Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3617827Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3619100Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3620376Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3621708Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3623010Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3623291Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:24.3623365Z Autotune Choices Stats: 2025-12-04T10:01:24.3625006Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3625564Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3625928Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3626559Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3627934Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3629254Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3630611Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3631962Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3633307Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3634617Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3635973Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3637284Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3638594Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.3639915Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3640251Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:24.3640387Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3640465Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3640527Z unimplemented [] 2025-12-04T10:01:24.3640667Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3640880Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3642302Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3642376Z graph_break [] 2025-12-04T10:01:24.3642513Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3642586Z Autotune Choices Stats: 2025-12-04T10:01:24.3644175Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.3644504Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3644744Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3645093Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3646379Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3647647Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3648917Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3650268Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3651578Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3652856Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3653134Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:24.3653245Z Autotune Choices Stats: 2025-12-04T10:01:24.3654881Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3655638Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3656018Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3656665Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3658022Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3659355Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3660863Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3662250Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3663581Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3664945Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3666267Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3667652Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3668975Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3670335Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3670663Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:24.3670810Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3670890Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3670957Z unimplemented [] 2025-12-04T10:01:24.3671066Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3671313Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3672706Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3672773Z graph_break [] 2025-12-04T10:01:24.3672906Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3673028Z Autotune Choices Stats: 2025-12-04T10:01:24.3674632Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.3674926Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3675166Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3675520Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3676831Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3678095Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3679406Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3680731Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3682036Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3683309Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3683623Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:24.3683697Z Autotune Choices Stats: 2025-12-04T10:01:24.3685336Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.3685861Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3686221Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3686862Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3688187Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3689572Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3690915Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3692237Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3693580Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3694902Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3696225Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3697541Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3698893Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.3700231Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3700533Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:24.3700702Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3700781Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3700845Z unimplemented [] 2025-12-04T10:01:24.3700951Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3701162Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3702544Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3702646Z graph_break [] 2025-12-04T10:01:24.3702782Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3702852Z Autotune Choices Stats: 2025-12-04T10:01:24.3704450Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:24.3704736Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3704980Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3705330Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3706623Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3707984Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3709298Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3710628Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3711896Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3713221Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3713503Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:24.3713579Z Autotune Choices Stats: 2025-12-04T10:01:24.3715210Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3715740Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3716098Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3716738Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3718107Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3719462Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3720809Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3722127Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3723480Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3724796Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3726114Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3727424Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3728811Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3730153Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.3730453Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:24.3730592Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3730670Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3730734Z unimplemented [] 2025-12-04T10:01:24.3730839Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3731085Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3732473Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3732542Z graph_break [] 2025-12-04T10:01:24.3732672Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3732743Z Autotune Choices Stats: 2025-12-04T10:01:24.3734345Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:24.3734627Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3734880Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3735236Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3736526Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3737865Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3739170Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3740452Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3741751Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3743032Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3743306Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:24.3743379Z Autotune Choices Stats: 2025-12-04T10:01:24.3745013Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3745531Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3745927Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3746567Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3748032Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3749401Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3750712Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3752076Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3753385Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3754705Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3756193Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3757662Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3759034Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3760348Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.3760685Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:24.3760897Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3761019Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3761085Z unimplemented [] 2025-12-04T10:01:24.3761192Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3761404Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3762798Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3762866Z graph_break [] 2025-12-04T10:01:24.3763005Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3763075Z Autotune Choices Stats: 2025-12-04T10:01:24.3764670Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.3764955Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3765199Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3765617Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3766951Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3768215Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3769529Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3770799Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3772110Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3773385Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3773665Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:24.3773738Z Autotune Choices Stats: 2025-12-04T10:01:24.3775368Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3775932Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3776299Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3776963Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3778323Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3779646Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3780995Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3782382Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3783700Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3785024Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3786400Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3787832Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3789188Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3790504Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.3790829Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:24.3790965Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3791038Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3791101Z unimplemented [] 2025-12-04T10:01:24.3791206Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3791414Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3792811Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3792881Z graph_break [] 2025-12-04T10:01:24.3793015Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3793090Z Autotune Choices Stats: 2025-12-04T10:01:24.3794670Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:24.3794993Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3795237Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3795587Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3796912Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3798214Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3799501Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3800806Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3802082Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3803368Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3803647Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:24.3803714Z Autotune Choices Stats: 2025-12-04T10:01:24.3805360Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3806146Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3806511Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3807176Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3808512Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3809848Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3811200Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3812520Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3813835Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3815183Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3816534Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3817875Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3819200Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.3820559Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3820843Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:24.3820978Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3821054Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3821117Z unimplemented [] 2025-12-04T10:01:24.3821233Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3821441Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3822828Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3822893Z graph_break [] 2025-12-04T10:01:24.3823027Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3823094Z Autotune Choices Stats: 2025-12-04T10:01:24.3824694Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.3825018Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3825294Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3825647Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3826975Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3828301Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3829612Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3830896Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3832169Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3833441Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3833761Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:24.3833827Z Autotune Choices Stats: 2025-12-04T10:01:24.3835503Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3836033Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3836442Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3837075Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3838403Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3839758Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3841075Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3842399Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3843712Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3845101Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3846459Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3847769Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3849114Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.3850425Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3850717Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:24.3850850Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3850925Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3850987Z unimplemented [] 2025-12-04T10:01:24.3851091Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3851304Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3852691Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3852797Z graph_break [] 2025-12-04T10:01:24.3852939Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3853008Z Autotune Choices Stats: 2025-12-04T10:01:24.3854634Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:24.3854918Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3855167Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3855887Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3857214Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3858546Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3859822Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3861097Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3862383Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3863659Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3863989Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:24.3864058Z Autotune Choices Stats: 2025-12-04T10:01:24.3865795Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.3866327Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3866700Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3867408Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3868800Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3870130Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3871453Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3872782Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3874128Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3875484Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3876844Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3878158Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3879515Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3880843Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3881136Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:24.3881276Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3881353Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3881423Z unimplemented [] 2025-12-04T10:01:24.3881528Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3881737Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3883126Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3883232Z graph_break [] 2025-12-04T10:01:24.3883367Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3883434Z Autotune Choices Stats: 2025-12-04T10:01:24.3885111Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.3885399Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3885647Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3886001Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3887290Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3888600Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3889881Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3891154Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3892423Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3893794Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3894073Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:24.3894152Z Autotune Choices Stats: 2025-12-04T10:01:24.3895823Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3896344Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3896743Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3897382Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3898714Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3900035Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3901363Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3902722Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3904066Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3905586Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3906926Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3908359Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3909676Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3910996Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.3911284Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:24.3911425Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3911538Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3911607Z unimplemented [] 2025-12-04T10:01:24.3911712Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3911922Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3913340Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3913409Z graph_break [] 2025-12-04T10:01:24.3913544Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3913612Z Autotune Choices Stats: 2025-12-04T10:01:24.3915243Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.3915536Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3915780Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3916174Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3917465Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3918736Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3920025Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3921293Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3922607Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3923926Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3924241Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:24.3924309Z Autotune Choices Stats: 2025-12-04T10:01:24.3925959Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.3926517Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3926887Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3927516Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3928862Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3930192Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3931525Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3932944Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3934292Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3935617Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3936981Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3938306Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3939633Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.3940947Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3941271Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:24.3941408Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3941487Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3941557Z unimplemented [] 2025-12-04T10:01:24.3941664Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3941876Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3943296Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3943361Z graph_break [] 2025-12-04T10:01:24.3943530Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3943598Z Autotune Choices Stats: 2025-12-04T10:01:24.3945197Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.3945529Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3945784Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3946138Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3947482Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3948755Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3950033Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3951348Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3952656Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3953961Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3954246Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:24.3954326Z Autotune Choices Stats: 2025-12-04T10:01:24.3956262Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3956975Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3957426Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3958164Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3959503Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3960823Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3962244Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3963607Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3964921Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3966299Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3967615Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3968940Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3970267Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3971621Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.3971905Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:24.3972072Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.3972144Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.3972223Z unimplemented [] 2025-12-04T10:01:24.3972328Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.3972535Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.3973957Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.3974020Z graph_break [] 2025-12-04T10:01:24.3974161Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.3974228Z Autotune Choices Stats: 2025-12-04T10:01:24.3975817Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.3976134Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3976378Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3976731Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3978015Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3979289Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3980572Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.3981912Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.3983223Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3984499Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3984814Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:24.3984880Z Autotune Choices Stats: 2025-12-04T10:01:24.3986517Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.3987031Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.3987455Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.3988088Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.3989470Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3990843Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3992214Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3993588Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.3994902Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.3996263Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.3997582Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.3998911Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4000227Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4001612Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.4001895Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:24.4002033Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4002104Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4002173Z unimplemented [] 2025-12-04T10:01:24.4002330Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4002534Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4003922Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4004018Z graph_break [] 2025-12-04T10:01:24.4004158Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4004225Z Autotune Choices Stats: 2025-12-04T10:01:24.4005919Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.4006263Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4006513Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4006870Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4008166Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4009439Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4010801Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4012106Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4013387Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4014688Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4014969Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:24.4015037Z Autotune Choices Stats: 2025-12-04T10:01:24.4016685Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.4017203Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4017573Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4018210Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4019547Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4020933Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4022291Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4023621Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4024976Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4026291Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4027683Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4029011Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4030407Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4031748Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4032032Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:24.4032166Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4032236Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4032304Z unimplemented [] 2025-12-04T10:01:24.4032408Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4032613Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4034007Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4034117Z graph_break [] 2025-12-04T10:01:24.4034259Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4034325Z Autotune Choices Stats: 2025-12-04T10:01:24.4036010Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.4036351Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4036643Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4037059Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4038441Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4039748Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4041070Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4042362Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4043655Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4044960Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4045241Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:24.4045309Z Autotune Choices Stats: 2025-12-04T10:01:24.4046951Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.4047471Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4047835Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4048506Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4049878Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4051229Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4052552Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4053914Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4055415Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4056747Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4058070Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4059457Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4060819Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.4062203Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4062494Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:24.4062633Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4062759Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4062829Z unimplemented [] 2025-12-04T10:01:24.4062937Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4063144Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4064536Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4064598Z graph_break [] 2025-12-04T10:01:24.4064740Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4064810Z Autotune Choices Stats: 2025-12-04T10:01:24.4066413Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.4066699Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4066951Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4067387Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4068718Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4070032Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4071349Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4072621Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4073929Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4075206Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4075497Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:24.4075566Z Autotune Choices Stats: 2025-12-04T10:01:24.4077214Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.4077734Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4078158Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4078809Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4080190Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4081542Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4082873Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4084234Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4085554Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4086884Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4088203Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4089594Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4090945Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4092262Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.4092578Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:24.4092717Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4092789Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4092857Z unimplemented [] 2025-12-04T10:01:24.4092963Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4093166Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4094565Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4094629Z graph_break [] 2025-12-04T10:01:24.4094768Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4094837Z Autotune Choices Stats: 2025-12-04T10:01:24.4096432Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.4096715Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4096999Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4097354Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4098674Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4099981Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4101266Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4102565Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4103843Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4105122Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4105407Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:24.4105473Z Autotune Choices Stats: 2025-12-04T10:01:24.4107126Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.4107739Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4108142Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4108780Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4110151Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4111473Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4112852Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4114172Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4115502Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4116827Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4118202Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4119553Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4120866Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4122219Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.4122498Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:24.4122631Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4122701Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4122768Z unimplemented [] 2025-12-04T10:01:24.4122873Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4123077Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4124466Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4124527Z graph_break [] 2025-12-04T10:01:24.4124664Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4124732Z Autotune Choices Stats: 2025-12-04T10:01:24.4126338Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.4126660Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4126897Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4127285Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4128602Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4129875Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4131153Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4132462Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4133750Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4135029Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4135315Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:24.4135419Z Autotune Choices Stats: 2025-12-04T10:01:24.4137064Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:24.4137614Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4137984Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4138651Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4139990Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4141345Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4142668Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4143986Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4145303Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4146671Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4148070Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4149446Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4150767Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4152123Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.4152402Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:24.4152542Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4152614Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4152683Z unimplemented [] 2025-12-04T10:01:24.4152790Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4152993Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4154391Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4154455Z graph_break [] 2025-12-04T10:01:24.4154593Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4154702Z Autotune Choices Stats: 2025-12-04T10:01:24.4156562Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.4156939Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4157186Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4157550Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4158885Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4160171Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4161503Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4162779Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4164076Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4165352Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4165695Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:24.4165764Z Autotune Choices Stats: 2025-12-04T10:01:24.4167453Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.4168013Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4168380Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4169015Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4170388Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4171717Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4173057Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4174379Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4175735Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4177095Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4178441Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4179766Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4181118Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4182440Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4182717Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:24.4182860Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4182932Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4182999Z unimplemented [] 2025-12-04T10:01:24.4183103Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4183305Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4184706Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4184808Z graph_break [] 2025-12-04T10:01:24.4184947Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4185013Z Autotune Choices Stats: 2025-12-04T10:01:24.4186669Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.4186963Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4187326Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4187688Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4188974Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4190294Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4191567Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4192851Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4194126Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4195438Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4195761Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:24.4195830Z Autotune Choices Stats: 2025-12-04T10:01:24.4197501Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.4198022Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4198386Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4199052Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4200394Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4201719Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4203043Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4204356Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4205743Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4207092Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4208408Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4209760Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4211082Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4212416Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.4212696Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:24.4212839Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4212907Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4212978Z unimplemented [] 2025-12-04T10:01:24.4213083Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4213321Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4214717Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4214778Z graph_break [] 2025-12-04T10:01:24.4214953Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4215023Z Autotune Choices Stats: 2025-12-04T10:01:24.4216641Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.4216939Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4217179Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4217534Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4218863Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4220138Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4221410Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4222696Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4224024Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4225335Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4225621Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:24.4225693Z Autotune Choices Stats: 2025-12-04T10:01:24.4227435Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.4227987Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4228360Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4228994Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4230336Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4231661Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4232986Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4234378Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4235740Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4237077Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4238395Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4239763Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4241074Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4242398Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.4242680Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:24.4242857Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4242928Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4243000Z unimplemented [] 2025-12-04T10:01:24.4243111Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4243310Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4244731Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4244799Z graph_break [] 2025-12-04T10:01:24.4244939Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4245004Z Autotune Choices Stats: 2025-12-04T10:01:24.4246623Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.4246916Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4247186Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4247541Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4248824Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4250111Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4251383Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4252656Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4253996Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4255519Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4255826Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:24.4255937Z Autotune Choices Stats: 2025-12-04T10:01:24.4257692Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:24.4258316Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4258685Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4259325Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4260666Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4261983Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4263367Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4264725Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4266095Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4267470Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4268830Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4270153Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4271475Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4272793Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.4273109Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:24.4273251Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4273319Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4273383Z unimplemented [] 2025-12-04T10:01:24.4273529Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4273732Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4275155Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4275220Z graph_break [] 2025-12-04T10:01:24.4275357Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4275425Z Autotune Choices Stats: 2025-12-04T10:01:24.4277018Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.4277345Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4277582Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4277936Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4279233Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4280516Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4281789Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4283134Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4284412Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4285719Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4286003Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:24.4286111Z Autotune Choices Stats: 2025-12-04T10:01:24.4287752Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.4288273Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4288636Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4289276Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4290638Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4291958Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4293354Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4294728Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4296052Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4297412Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4298732Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4300064Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4301379Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4302741Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4303053Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:24.4303198Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4303269Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4303334Z unimplemented [] 2025-12-04T10:01:24.4303448Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4303650Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4305079Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4305144Z graph_break [] 2025-12-04T10:01:24.4305282Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4305383Z Autotune Choices Stats: 2025-12-04T10:01:24.4306981Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.4307335Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4307580Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4307944Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4309230Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4310508Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4311830Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4313431Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4314769Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4316124Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4316594Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:24.4316667Z Autotune Choices Stats: 2025-12-04T10:01:24.4318309Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.4318836Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4319198Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4319841Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4321175Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4322596Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4323959Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4325278Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4326641Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4327964Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4329288Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4330610Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4331965Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4333342Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4333633Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:24.4333807Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4333878Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4333940Z unimplemented [] 2025-12-04T10:01:24.4334050Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4334256Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4335648Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4335746Z graph_break [] 2025-12-04T10:01:24.4335878Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4335951Z Autotune Choices Stats: 2025-12-04T10:01:24.4337539Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.4337832Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4338071Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4338425Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4339712Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4341021Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4342321Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4343631Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4344913Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4346361Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4346704Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:24.4346786Z Autotune Choices Stats: 2025-12-04T10:01:24.4348657Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:24.4349183Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4349545Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4350188Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4351568Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4352935Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4354303Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4355830Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4357491Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4358814Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4360140Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4361469Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4362875Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4364241Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4364524Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:24.4364666Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4364737Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4364799Z unimplemented [] 2025-12-04T10:01:24.4364908Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4365151Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4366543Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4366605Z graph_break [] 2025-12-04T10:01:24.4366741Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4366818Z Autotune Choices Stats: 2025-12-04T10:01:24.4368414Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.4368709Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4368948Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4369307Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4370592Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4371953Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4373274Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4374557Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4375878Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4377159Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4377443Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:24.4377512Z Autotune Choices Stats: 2025-12-04T10:01:24.4379155Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.4379678Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4380075Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4380717Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4382084Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4383445Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4384784Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4386350Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4388418Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4390461Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4392508Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4394577Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4396587Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4398637Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4399115Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:24.4399343Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4399454Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4399554Z unimplemented [] 2025-12-04T10:01:24.4399728Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4400050Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4402215Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4402319Z graph_break [] 2025-12-04T10:01:24.4402534Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4402654Z Autotune Choices Stats: 2025-12-04T10:01:24.4405133Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_574", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.4405581Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4405969Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4406595Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4408609Z triton_flex_attention_574 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4410578Z triton_flex_attention_575 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4412628Z triton_flex_attention_572 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4414640Z triton_flex_attention_570 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4416723Z triton_flex_attention_573 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4418722Z triton_flex_attention_571 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4419166Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3286 seconds precompiling for 6 choices 2025-12-04T10:01:24.4419274Z Autotune Choices Stats: 2025-12-04T10:01:24.4421373Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014303999952971935, "best_triton_pos": 0} 2025-12-04T10:01:24.4421964Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4422329Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4423005Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4424397Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4425739Z triton_flex_attention_backward_576 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4427098Z triton_flex_attention_backward_577 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4428491Z triton_flex_attention_backward_578 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4429819Z triton_flex_attention_backward_581 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4431132Z triton_flex_attention_backward_583 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4432487Z triton_flex_attention_backward_580 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4433850Z triton_flex_attention_backward_582 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4435206Z triton_flex_attention_backward_585 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4436528Z triton_flex_attention_backward_588 0.0174 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.4436853Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2260 seconds precompiling for 13 choices 2025-12-04T10:01:24.4437003Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4437077Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4437141Z unimplemented [] 2025-12-04T10:01:24.4437255Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4437466Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4438873Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4438938Z graph_break [] 2025-12-04T10:01:24.4439074Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4439151Z Autotune Choices Stats: 2025-12-04T10:01:24.4440746Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.4441076Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4441317Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4441681Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4443005Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4444323Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4445601Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4446909Z triton_flex_attention_589 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4448187Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4449478Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4449759Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.2961 seconds precompiling for 6 choices 2025-12-04T10:01:24.4449834Z Autotune Choices Stats: 2025-12-04T10:01:24.4451469Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.4452040Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4452435Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4453082Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4454497Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4456203Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4457667Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4459010Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4460339Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4461661Z triton_flex_attention_backward_601 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4463088Z triton_flex_attention_backward_602 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4464464Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4465786Z triton_flex_attention_backward_604 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4467148Z triton_flex_attention_backward_603 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4467501Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.3242 seconds precompiling for 13 choices 2025-12-04T10:01:24.4467657Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4467731Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4467796Z unimplemented [] 2025-12-04T10:01:24.4467911Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4468119Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4469521Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4469583Z graph_break [] 2025-12-04T10:01:24.4469724Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4469803Z Autotune Choices Stats: 2025-12-04T10:01:24.4471403Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.4471737Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4472013Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4472390Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4473715Z triton_flex_attention_612 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4475002Z triton_flex_attention_613 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4476326Z triton_flex_attention_608 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4477610Z triton_flex_attention_610 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4478889Z triton_flex_attention_611 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4480168Z triton_flex_attention_609 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4480486Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3064 seconds precompiling for 6 choices 2025-12-04T10:01:24.4480561Z Autotune Choices Stats: 2025-12-04T10:01:24.4482236Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_616", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.4482762Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4483159Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4483807Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4485144Z triton_flex_attention_backward_616 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4486503Z triton_flex_attention_backward_615 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4487821Z triton_flex_attention_backward_617 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4489146Z triton_flex_attention_backward_614 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4490470Z triton_flex_attention_backward_619 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4491875Z triton_flex_attention_backward_620 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4493243Z triton_flex_attention_backward_618 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4494564Z triton_flex_attention_backward_621 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4495941Z triton_flex_attention_backward_623 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4497270Z triton_flex_attention_backward_622 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4497552Z SingleProcess AUTOTUNE benchmarking takes 0.6701 seconds and 2.3164 seconds precompiling for 13 choices 2025-12-04T10:01:24.4497696Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4497771Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4497835Z unimplemented [] 2025-12-04T10:01:24.4497947Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4498152Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4499555Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4499657Z graph_break [] 2025-12-04T10:01:24.4499797Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4499871Z Autotune Choices Stats: 2025-12-04T10:01:24.4501527Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.4501823Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4502069Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4502464Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4503770Z triton_flex_attention_631 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4505098Z triton_flex_attention_632 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4506389Z triton_flex_attention_629 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4507757Z triton_flex_attention_627 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4509045Z triton_flex_attention_630 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4510345Z triton_flex_attention_628 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4510666Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.2948 seconds precompiling for 6 choices 2025-12-04T10:01:24.4510741Z Autotune Choices Stats: 2025-12-04T10:01:24.4512428Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.4512990Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4513357Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4514005Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4515379Z triton_flex_attention_backward_634 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4516714Z triton_flex_attention_backward_635 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4518039Z triton_flex_attention_backward_636 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4519365Z triton_flex_attention_backward_633 0.0144 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4520738Z triton_flex_attention_backward_638 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4522097Z triton_flex_attention_backward_640 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4523464Z triton_flex_attention_backward_637 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4524795Z triton_flex_attention_backward_639 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4526166Z triton_flex_attention_backward_642 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4527503Z triton_flex_attention_backward_641 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4527793Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.2589 seconds precompiling for 13 choices 2025-12-04T10:01:24.4527936Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4528009Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4528072Z unimplemented [] 2025-12-04T10:01:24.4528191Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4528400Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4529797Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4529935Z graph_break [] 2025-12-04T10:01:24.4530069Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4530143Z Autotune Choices Stats: 2025-12-04T10:01:24.4531777Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.4532110Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4532354Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4532714Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4534011Z triton_flex_attention_650 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4535338Z triton_flex_attention_651 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4536615Z triton_flex_attention_646 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4537898Z triton_flex_attention_648 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4539177Z triton_flex_attention_649 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4540521Z triton_flex_attention_647 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4540801Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3235 seconds precompiling for 6 choices 2025-12-04T10:01:24.4540874Z Autotune Choices Stats: 2025-12-04T10:01:24.4542562Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_653", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.4543094Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4543508Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4544157Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4545506Z triton_flex_attention_backward_653 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4546862Z triton_flex_attention_backward_654 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4548235Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4549606Z triton_flex_attention_backward_652 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4550962Z triton_flex_attention_backward_656 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4552318Z triton_flex_attention_backward_657 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4553649Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4555008Z triton_flex_attention_backward_658 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4556902Z triton_flex_attention_backward_661 0.0164 ms 80.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4558252Z triton_flex_attention_backward_660 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4558538Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.4000 seconds precompiling for 13 choices 2025-12-04T10:01:24.4558682Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4558854Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4558919Z unimplemented [] 2025-12-04T10:01:24.4559032Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4559242Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4560687Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4560758Z graph_break [] 2025-12-04T10:01:24.4560897Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4560969Z Autotune Choices Stats: 2025-12-04T10:01:24.4562638Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.4562939Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4563178Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4563603Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4564901Z triton_flex_attention_669 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4566197Z triton_flex_attention_670 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4567480Z triton_flex_attention_665 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4568770Z triton_flex_attention_667 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4570088Z triton_flex_attention_668 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4571409Z triton_flex_attention_666 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4571726Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.2940 seconds precompiling for 6 choices 2025-12-04T10:01:24.4571802Z Autotune Choices Stats: 2025-12-04T10:01:24.4573583Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.4574136Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4574500Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4575156Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4576501Z triton_flex_attention_backward_673 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4577847Z triton_flex_attention_backward_674 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4579174Z triton_flex_attention_backward_671 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4580567Z triton_flex_attention_backward_672 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4581937Z triton_flex_attention_backward_676 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4583259Z triton_flex_attention_backward_675 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4584620Z triton_flex_attention_backward_677 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4585951Z triton_flex_attention_backward_678 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4587335Z triton_flex_attention_backward_680 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4588661Z triton_flex_attention_backward_683 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.4588985Z SingleProcess AUTOTUNE benchmarking takes 0.6679 seconds and 2.3879 seconds precompiling for 13 choices 2025-12-04T10:01:24.4589125Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4589203Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4589265Z unimplemented [] 2025-12-04T10:01:24.4589379Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4589585Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4591008Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4591082Z graph_break [] 2025-12-04T10:01:24.4591247Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4591322Z Autotune Choices Stats: 2025-12-04T10:01:24.4592913Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_689", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010367999784648418, "best_triton_pos": 0} 2025-12-04T10:01:24.4593243Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4593489Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4593843Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4595135Z triton_flex_attention_689 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4596436Z triton_flex_attention_688 0.0113 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4597713Z triton_flex_attention_684 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4599045Z triton_flex_attention_686 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4600352Z triton_flex_attention_687 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4601706Z triton_flex_attention_685 0.0164 ms 63.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4601986Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3120 seconds precompiling for 6 choices 2025-12-04T10:01:24.4602063Z Autotune Choices Stats: 2025-12-04T10:01:24.4603715Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_690", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.4604275Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4604641Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4605293Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4606630Z triton_flex_attention_backward_690 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4607974Z triton_flex_attention_backward_691 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4609371Z triton_flex_attention_backward_692 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4610735Z triton_flex_attention_backward_693 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4612076Z triton_flex_attention_backward_695 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4613396Z triton_flex_attention_backward_696 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4614760Z triton_flex_attention_backward_697 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4616088Z triton_flex_attention_backward_694 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4617419Z triton_flex_attention_backward_699 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4618781Z triton_flex_attention_backward_702 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.4619060Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.3417 seconds precompiling for 13 choices 2025-12-04T10:01:24.4619228Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4619303Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4619365Z unimplemented [] 2025-12-04T10:01:24.4619477Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4619685Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4621106Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4621173Z graph_break [] 2025-12-04T10:01:24.4621304Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4621378Z Autotune Choices Stats: 2025-12-04T10:01:24.4622988Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.4623319Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4623556Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4623919Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4625219Z triton_flex_attention_707 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4626507Z triton_flex_attention_708 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4627851Z triton_flex_attention_705 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4629207Z triton_flex_attention_706 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4630516Z triton_flex_attention_703 0.0144 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4631808Z triton_flex_attention_704 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4632119Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.2972 seconds precompiling for 6 choices 2025-12-04T10:01:24.4632193Z Autotune Choices Stats: 2025-12-04T10:01:24.4633838Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_709", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.4634358Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4634722Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4635373Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4636713Z triton_flex_attention_backward_709 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4638089Z triton_flex_attention_backward_710 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4639489Z triton_flex_attention_backward_711 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4640873Z triton_flex_attention_backward_712 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4642202Z triton_flex_attention_backward_714 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4643565Z triton_flex_attention_backward_715 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4644889Z triton_flex_attention_backward_713 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4646210Z triton_flex_attention_backward_716 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4647534Z triton_flex_attention_backward_721 0.0174 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.4648932Z triton_flex_attention_backward_717 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4649215Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.1509 seconds precompiling for 13 choices 2025-12-04T10:01:24.4649350Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4649427Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4649491Z unimplemented [] 2025-12-04T10:01:24.4649629Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4649842Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4651231Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4651333Z graph_break [] 2025-12-04T10:01:24.4651468Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4651541Z Autotune Choices Stats: 2025-12-04T10:01:24.4653141Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.4653424Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4653663Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4654020Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4655585Z triton_flex_attention_726 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4656984Z triton_flex_attention_727 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4658398Z triton_flex_attention_722 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4659725Z triton_flex_attention_724 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4661015Z triton_flex_attention_725 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4662303Z triton_flex_attention_723 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4668491Z SingleProcess AUTOTUNE benchmarking takes 0.2941 seconds and 1.2814 seconds precompiling for 6 choices 2025-12-04T10:01:24.4668612Z Autotune Choices Stats: 2025-12-04T10:01:24.4670307Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_729", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.4670845Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4671213Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4671855Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4673212Z triton_flex_attention_backward_729 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4674659Z triton_flex_attention_backward_730 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4676010Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4677339Z triton_flex_attention_backward_733 0.0144 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4678690Z triton_flex_attention_backward_728 0.0153 ms 87.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4680004Z triton_flex_attention_backward_735 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4681330Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4682639Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4683999Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4685342Z triton_flex_attention_backward_740 0.0173 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.4685687Z SingleProcess AUTOTUNE benchmarking takes 0.6682 seconds and 2.2393 seconds precompiling for 13 choices 2025-12-04T10:01:24.4685892Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:24.4685989Z Traceback (most recent call last): 2025-12-04T10:01:24.4686349Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:24.4686421Z self.assertTrue( 2025-12-04T10:01:24.4686657Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:24.4686778Z raise self.failureException(msg) 2025-12-04T10:01:24.4687059Z AssertionError: False is not true : Log file /tmp/tmp80jrgwb0/flex_attention_configs.json was not created 2025-12-04T10:01:24.4687066Z 2025-12-04T10:01:24.4687209Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:24.4687498Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:24.4687502Z 2025-12-04T10:01:24.4687689Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:24.4687832Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4687911Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4687978Z unimplemented [] 2025-12-04T10:01:24.4688093Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4689511Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:24.4689722Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4689790Z graph_break [] 2025-12-04T10:01:24.4689927Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4691092Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:24.4691184Z current_size = base.storage().size() 2025-12-04T10:01:24.4691297Z Autotune Choices Stats: 2025-12-04T10:01:24.4692894Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:24.4693223Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4693467Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4693822Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4695153Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4696413Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4697718Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4698975Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4700237Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4701503Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4701849Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:24.4701920Z Autotune Choices Stats: 2025-12-04T10:01:24.4703593Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.4704116Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4704517Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4705157Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4706484Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4707897Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4709203Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4710513Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4711820Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4713205Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4714545Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4715877Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4717221Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4718520Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4718812Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:24.4718958Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4719031Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4719102Z unimplemented [] 2025-12-04T10:01:24.4719211Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4719425Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4720816Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4720921Z graph_break [] 2025-12-04T10:01:24.4721065Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4721134Z Autotune Choices Stats: 2025-12-04T10:01:24.4722760Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.4723049Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4723331Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4723695Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4724979Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4726276Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4727545Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4728807Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4730068Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4731374Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4731655Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:24.4731758Z Autotune Choices Stats: 2025-12-04T10:01:24.4733421Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.4733946Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4734315Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4735003Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4736337Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4737646Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4738968Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4740280Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4741652Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4742997Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4744306Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4745656Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4746972Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4748358Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4748647Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:24.4748785Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4748858Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4748929Z unimplemented [] 2025-12-04T10:01:24.4749039Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4749246Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4750698Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4750763Z graph_break [] 2025-12-04T10:01:24.4750911Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4751015Z Autotune Choices Stats: 2025-12-04T10:01:24.4752635Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.4752930Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4753174Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4753529Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4754834Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4756377Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4757668Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4758945Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4760218Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4761615Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4761907Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:24.4761979Z Autotune Choices Stats: 2025-12-04T10:01:24.4763670Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.4764194Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4764609Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4765242Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4766592Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4767903Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4769212Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4770563Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4771897Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4773258Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4774561Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4775908Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4777226Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.4778531Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4778820Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:24.4779000Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4779071Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4779142Z unimplemented [] 2025-12-04T10:01:24.4779256Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4779465Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4780908Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4780975Z graph_break [] 2025-12-04T10:01:24.4781120Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4781190Z Autotune Choices Stats: 2025-12-04T10:01:24.4782809Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.4783099Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4783377Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4783731Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4785008Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4786274Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4787606Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4788904Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4790267Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4791586Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4791875Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:24.4791948Z Autotune Choices Stats: 2025-12-04T10:01:24.4793603Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.4794166Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4794533Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4795171Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4796512Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4797830Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4799185Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4800543Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4801885Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4803202Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4804545Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4805880Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4807192Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4808520Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.4808852Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:24.4808991Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4809063Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4809135Z unimplemented [] 2025-12-04T10:01:24.4809242Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4809502Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4810937Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4811006Z graph_break [] 2025-12-04T10:01:24.4811147Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4811216Z Autotune Choices Stats: 2025-12-04T10:01:24.4812825Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.4813151Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4813398Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4813756Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4815048Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4816336Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4817604Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4818907Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4820216Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4821536Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4821829Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:24.4821896Z Autotune Choices Stats: 2025-12-04T10:01:24.4823572Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.4824091Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4824458Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4825095Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4826435Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4827832Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4829233Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4830582Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4831903Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4833252Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4834572Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4835904Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4837215Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.4838567Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4838850Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:24.4839236Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4839313Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4839387Z unimplemented [] 2025-12-04T10:01:24.4839496Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4839704Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4841144Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4841210Z graph_break [] 2025-12-04T10:01:24.4841351Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4841419Z Autotune Choices Stats: 2025-12-04T10:01:24.4843071Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.4843357Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4843599Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4843954Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4845241Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4846529Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4847863Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4849168Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4850482Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4851756Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4852073Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:24.4852141Z Autotune Choices Stats: 2025-12-04T10:01:24.4853772Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.4854295Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4854661Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4855506Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4856859Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4858270Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4859639Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4861012Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4862335Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4863708Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4865025Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4866368Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4867740Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4869137Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4869420Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:24.4869564Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4869636Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4869741Z unimplemented [] 2025-12-04T10:01:24.4869852Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4870057Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4871453Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4871565Z graph_break [] 2025-12-04T10:01:24.4871708Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4871777Z Autotune Choices Stats: 2025-12-04T10:01:24.4873374Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.4873659Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4873900Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4874258Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4875538Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4876821Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4878166Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4879476Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4880748Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4882075Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4882360Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:24.4882429Z Autotune Choices Stats: 2025-12-04T10:01:24.4884064Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.4884580Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4884944Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4885585Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4886977Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4888347Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4889708Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4891031Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4892392Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4893735Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4895047Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4896367Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4897752Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.4899101Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4899383Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:24.4899524Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4899593Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4899664Z unimplemented [] 2025-12-04T10:01:24.4899781Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4899985Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4901418Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4901482Z graph_break [] 2025-12-04T10:01:24.4901623Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4901690Z Autotune Choices Stats: 2025-12-04T10:01:24.4903276Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:24.4903566Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4903802Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4904160Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4905444Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4906785Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4908114Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4909435Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4910709Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4912021Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4912302Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:24.4912371Z Autotune Choices Stats: 2025-12-04T10:01:24.4914013Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.4914531Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4914895Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4915572Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4916950Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4918324Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4919647Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4921029Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4922355Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4923674Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4924987Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4926358Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4927704Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4929056Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.4929341Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:24.4929517Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4929587Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4929658Z unimplemented [] 2025-12-04T10:01:24.4929765Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4929981Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4931378Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4931442Z graph_break [] 2025-12-04T10:01:24.4931581Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4931651Z Autotune Choices Stats: 2025-12-04T10:01:24.4933245Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:24.4933546Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4933786Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4934148Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4935477Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4936793Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4938103Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4939382Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4940702Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4941971Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4942264Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:24.4942332Z Autotune Choices Stats: 2025-12-04T10:01:24.4943972Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.4944530Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4944897Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4945571Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4946944Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4948339Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4949653Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4951006Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4952336Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4953653Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4955001Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4956640Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4958015Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4959336Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.4959678Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:24.4959825Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4959898Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4959972Z unimplemented [] 2025-12-04T10:01:24.4960081Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4960285Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4961679Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4961745Z graph_break [] 2025-12-04T10:01:24.4961889Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4961960Z Autotune Choices Stats: 2025-12-04T10:01:24.4963561Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.4963915Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4964153Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4964510Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4965838Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4967163Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4968442Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.4969747Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4971022Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4972295Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4972578Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:24.4972645Z Autotune Choices Stats: 2025-12-04T10:01:24.4974291Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.4974851Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4975250Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4975895Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4977257Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4978570Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4979927Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4981244Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4982563Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.4983881Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4985268Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.4986632Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.4987995Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4989358Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.4989638Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:24.4989779Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.4989850Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.4989914Z unimplemented [] 2025-12-04T10:01:24.4990027Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.4990231Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.4991633Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.4991705Z graph_break [] 2025-12-04T10:01:24.4991845Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.4991915Z Autotune Choices Stats: 2025-12-04T10:01:24.4993510Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:24.4993862Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.4994104Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.4994496Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.4995820Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4997103Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.4998406Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.4999690Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5000966Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5002249Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5002534Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:24.5002640Z Autotune Choices Stats: 2025-12-04T10:01:24.5004303Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.5004821Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5005192Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5005860Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5007189Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5008554Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5009883Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5011199Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5012520Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5013874Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5015216Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5016566Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5017878Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.5019227Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5019505Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:24.5019651Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.5019722Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.5019787Z unimplemented [] 2025-12-04T10:01:24.5019904Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.5020108Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.5021497Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.5021564Z graph_break [] 2025-12-04T10:01:24.5021698Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.5021822Z Autotune Choices Stats: 2025-12-04T10:01:24.5023449Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.5023741Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5023977Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5024336Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5025651Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5026925Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5028309Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5029586Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.5030863Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5032137Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5032463Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:24.5032533Z Autotune Choices Stats: 2025-12-04T10:01:24.5034189Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.5034749Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5035111Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5035764Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5037129Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5038452Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5039780Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5041099Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5042450Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5043791Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5045142Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5046459Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5047806Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.5049117Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5049398Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:24.5049537Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.5049606Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.5049668Z unimplemented [] 2025-12-04T10:01:24.5049780Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.5049986Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.5051382Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.5051483Z graph_break [] 2025-12-04T10:01:24.5051617Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.5051689Z Autotune Choices Stats: 2025-12-04T10:01:24.5053310Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:24.5053601Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5053875Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5054233Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5055918Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5057310Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5058579Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5059854Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5061127Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.5062496Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5062832Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:24.5062903Z Autotune Choices Stats: 2025-12-04T10:01:24.5064618Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.5065151Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5065518Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5066203Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5067626Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5068955Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5070286Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5071605Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5072996Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5074345Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5075674Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5077026Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5078342Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5079662Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5079949Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:24.5080096Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.5080171Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.5080236Z unimplemented [] 2025-12-04T10:01:24.5080388Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.5080604Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.5082001Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.5082064Z graph_break [] 2025-12-04T10:01:24.5082234Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.5082312Z Autotune Choices Stats: 2025-12-04T10:01:24.5083935Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.5084232Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5084475Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5084834Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5086164Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5087441Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5088725Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5090000Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.5091304Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5092616Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5092903Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:24.5092974Z Autotune Choices Stats: 2025-12-04T10:01:24.5094641Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.5095205Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5095575Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5096227Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5097561Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5098894Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5100212Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5101615Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5102979Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5104314Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5105674Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5106998Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5108368Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5109693Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.5110014Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:24.5110155Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.5110227Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.5110291Z unimplemented [] 2025-12-04T10:01:24.5110402Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.5110610Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.5112043Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.5112110Z graph_break [] 2025-12-04T10:01:24.5112243Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.5112318Z Autotune Choices Stats: 2025-12-04T10:01:24.5113950Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.5114283Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5114529Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5114899Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5116195Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5117462Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5118733Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5120015Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5121362Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.5122680Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5122966Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:24.5123034Z Autotune Choices Stats: 2025-12-04T10:01:24.5124668Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.5125233Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5125595Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5126244Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5127584Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5128901Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5130262Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5131619Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5132985Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5134302Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5135655Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5136978Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5138290Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.5139608Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5139940Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:24.5140087Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.5140162Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.5140228Z unimplemented [] 2025-12-04T10:01:24.5140381Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.5140591Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.5142028Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.5142094Z graph_break [] 2025-12-04T10:01:24.5142230Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.5142308Z Autotune Choices Stats: 2025-12-04T10:01:24.5143917Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.5144250Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5144498Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5144859Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5146142Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5147454Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5148726Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5150075Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5151381Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.5152660Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5152938Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:24.5153048Z Autotune Choices Stats: 2025-12-04T10:01:24.5154694Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.5155387Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5155759Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5156413Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5157747Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5159070Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5160526Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5161911Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5163236Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5164598Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5165920Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5167237Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5168554Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5169961Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.5170248Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:24.5170399Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.5170473Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.5170538Z unimplemented [] 2025-12-04T10:01:24.5170655Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.5170898Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.5172303Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.5172367Z graph_break [] 2025-12-04T10:01:24.5172567Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.5172648Z Autotune Choices Stats: 2025-12-04T10:01:24.5174232Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.5174533Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5174783Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5175146Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5176427Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5177704Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5179016Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.5180324Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5181624Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5182906Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5183222Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:24.5183298Z Autotune Choices Stats: 2025-12-04T10:01:24.5184925Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.5185450Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5185810Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5186461Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5187848Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5189258Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5190611Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5191929Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5193282Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5194602Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5195927Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5197238Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5198596Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5199951Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.5200236Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:24.5200411Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.5200484Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.5200547Z unimplemented [] 2025-12-04T10:01:24.5200662Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.5200868Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.5202267Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.5202372Z graph_break [] 2025-12-04T10:01:24.5202510Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.5202583Z Autotune Choices Stats: 2025-12-04T10:01:24.5204172Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.5204475Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5204716Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5205072Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5206370Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5207683Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5208988Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5210321Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.5211593Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5212905Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5213179Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:24.5213255Z Autotune Choices Stats: 2025-12-04T10:01:24.5214888Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.5215416Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5215777Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5216428Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5217795Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5219152Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5220506Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5221831Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5223189Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5224508Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5225834Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5227151Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5228592Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5229949Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5230234Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:24.5230377Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.5230448Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.5230513Z unimplemented [] 2025-12-04T10:01:24.5230658Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.5230867Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.5232257Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.5232329Z graph_break [] 2025-12-04T10:01:24.5232464Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.5232545Z Autotune Choices Stats: 2025-12-04T10:01:24.5234129Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.5234423Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5234661Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5235018Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5236304Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5237647Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5238953Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.5240234Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5241545Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5242825Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5243105Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:24.5243180Z Autotune Choices Stats: 2025-12-04T10:01:24.5244813Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.5245336Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5245737Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5246392Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5247776Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5249140Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5250455Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5251809Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5253127Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5254451Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5257863Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5259364Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5260738Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.5262076Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5262410Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:24.5262552Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.5262631Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.5262700Z unimplemented [] 2025-12-04T10:01:24.5262814Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.5263019Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.5264418Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.5264487Z graph_break [] 2025-12-04T10:01:24.5264628Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.5264702Z Autotune Choices Stats: 2025-12-04T10:01:24.5266321Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.5266619Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5266860Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5267319Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5268669Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5269982Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5271259Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.5272538Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5273846Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5275119Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5275399Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:24.5275476Z Autotune Choices Stats: 2025-12-04T10:01:24.5277118Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.5277685Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5278049Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5278726Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5280093Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5281416Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5282768Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5284087Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5285400Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5286731Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5288553Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5292848Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5297338Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5301695Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.5304374Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:24.5305204Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.5305709Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.5306061Z unimplemented [] 2025-12-04T10:01:24.5306415Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.5307003Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.5309751Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.5312220Z graph_break [] 2025-12-04T10:01:24.5312627Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.5313136Z Autotune Choices Stats: 2025-12-04T10:01:24.5315344Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.5317637Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5318266Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5318958Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5320790Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5323487Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5326132Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5328810Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5331452Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.5334107Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5335770Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:24.5336220Z Autotune Choices Stats: 2025-12-04T10:01:24.5337983Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.5340308Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5341282Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5342414Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5344489Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5347387Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5350131Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5352878Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5355796Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5358635Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5361428Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5364204Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5366936Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5369740Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.5371434Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:24.5371965Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.5372280Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.5372492Z unimplemented [] 2025-12-04T10:01:24.5372707Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.5373119Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.5374811Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.5376482Z graph_break [] 2025-12-04T10:01:24.5376777Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.5377130Z Autotune Choices Stats: 2025-12-04T10:01:24.5379009Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.5381019Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5381673Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5382370Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5384148Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5386805Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5389564Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5392231Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5394903Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.5397562Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5399259Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:24.5399702Z Autotune Choices Stats: 2025-12-04T10:01:24.5401508Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:24.5403761Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5404763Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5405863Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5407926Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5410706Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5413439Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5416170Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5418914Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5421898Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5424662Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5427447Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5430214Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5432946Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.5434642Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:24.5435159Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.5435462Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.5435673Z unimplemented [] 2025-12-04T10:01:24.5435892Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.5436294Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.5437991Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.5439590Z graph_break [] 2025-12-04T10:01:24.5439829Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.5440132Z Autotune Choices Stats: 2025-12-04T10:01:24.5441886Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.5443851Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5444468Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5445213Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5446972Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5449660Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5452313Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5454966Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5457786Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.5460527Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5462191Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:24.5462632Z Autotune Choices Stats: 2025-12-04T10:01:24.5464488Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.5466750Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5467810Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5468919Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5471648Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5474397Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5477161Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5479898Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5482738Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5485499Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5488233Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5490962Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5493733Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5496485Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5498182Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:24.5498698Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.5499000Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.5499207Z unimplemented [] 2025-12-04T10:01:24.5499423Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.5499819Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.5501518Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.5503089Z graph_break [] 2025-12-04T10:01:24.5503325Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.5503628Z Autotune Choices Stats: 2025-12-04T10:01:24.5505414Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.5507449Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5508072Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5508759Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5510518Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5513219Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5515860Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5518500Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5521150Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.5523898Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5525665Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:24.5526105Z Autotune Choices Stats: 2025-12-04T10:01:24.5527895Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.5530137Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5531145Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5532247Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5534324Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5537090Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5539828Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5542600Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5545356Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5548201Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5550944Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5553713Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5556606Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5559338Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.5561037Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:24.5561555Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.5561945Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.5562149Z unimplemented [] 2025-12-04T10:01:24.5562367Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.5562781Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.5564527Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.5566071Z graph_break [] 2025-12-04T10:01:24.5566309Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.5566614Z Autotune Choices Stats: 2025-12-04T10:01:24.5568388Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.5570362Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5570972Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5571711Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5573450Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5576099Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5578747Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5581383Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.5584083Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5586764Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5588514Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:24.5588955Z Autotune Choices Stats: 2025-12-04T10:01:24.5590711Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.5593003Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5593970Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5595063Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5597135Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5599870Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5602610Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5605435Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5608222Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5610957Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5613725Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5616472Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5619189Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5621915Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.5623645Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:24.5624147Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.5624453Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.5624670Z unimplemented [] 2025-12-04T10:01:24.5624881Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.5625285Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.5627015Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.5628629Z graph_break [] 2025-12-04T10:01:24.5628865Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.5629159Z Autotune Choices Stats: 2025-12-04T10:01:24.5630874Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.5632871Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5633510Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5634195Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5635956Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5638611Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5641258Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5643952Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.5646640Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5649324Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5650986Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:24.5651412Z Autotune Choices Stats: 2025-12-04T10:01:24.5653189Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:24.5655628Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5656609Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5657703Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5659766Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5662516Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5665392Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5668646Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5671391Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5674192Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5677207Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5680110Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5682834Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5685614Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.5687308Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:24.5687848Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.5688147Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.5688357Z unimplemented [] 2025-12-04T10:01:24.5688567Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.5688969Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.5690695Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.5692242Z graph_break [] 2025-12-04T10:01:24.5692479Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.5692781Z Autotune Choices Stats: 2025-12-04T10:01:24.5694503Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.5696512Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5697126Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5697813Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5699554Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5702201Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5704845Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5707646Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5710328Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.5712971Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5714660Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:24.5715098Z Autotune Choices Stats: 2025-12-04T10:01:24.5716857Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.5719103Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5720064Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5721160Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5723221Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5726013Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5728795Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5731586Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5734310Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5737090Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5739824Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5742557Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5745283Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5748146Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5749836Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:24.5750357Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.5750669Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.5750937Z unimplemented [] 2025-12-04T10:01:24.5751149Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.5751547Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.5753234Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.5754814Z graph_break [] 2025-12-04T10:01:24.5755041Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.5755505Z Autotune Choices Stats: 2025-12-04T10:01:24.5757252Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.5759218Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5759840Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5760523Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5762282Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5764936Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5767707Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5770401Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5773070Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5775793Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.5777450Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:24.5783073Z Autotune Choices Stats: 2025-12-04T10:01:24.5784933Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.5787301Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5788283Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5789398Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5791583Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5794370Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5797155Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5799898Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5802679Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5805410Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5808161Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5810890Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5813692Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5816470Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5818175Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:24.5818695Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.5819003Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.5819217Z unimplemented [] 2025-12-04T10:01:24.5819430Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.5819845Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.5821580Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.5823123Z graph_break [] 2025-12-04T10:01:24.5823360Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.5823656Z Autotune Choices Stats: 2025-12-04T10:01:24.5825395Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.5827430Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5828055Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5828743Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5830490Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5833194Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5835904Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5838593Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5841237Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.5843913Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5845563Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:24.5845999Z Autotune Choices Stats: 2025-12-04T10:01:24.5847777Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:24.5850023Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5850988Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5852129Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5854242Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5857279Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5860034Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5862857Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5865617Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5868427Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5871177Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5873982Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5876778Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5879555Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5881256Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:24.5881773Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.5882113Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.5882325Z unimplemented [] 2025-12-04T10:01:24.5882562Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.5882972Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.5884673Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.5886205Z graph_break [] 2025-12-04T10:01:24.5886438Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.5886738Z Autotune Choices Stats: 2025-12-04T10:01:24.5888459Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.5890421Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5891037Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5891735Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5893529Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5896217Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5898908Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5901564Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5904253Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.5906916Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5908621Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:24.5909071Z Autotune Choices Stats: 2025-12-04T10:01:24.5910833Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.5913148Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5914124Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5915225Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5917346Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5920135Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5922868Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5925651Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5928399Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5931156Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5933899Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5936726Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5939502Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5942236Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5943966Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:24.5944484Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.5944788Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.5944989Z unimplemented [] 2025-12-04T10:01:24.5945208Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.5945608Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.5947366Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.5948909Z graph_break [] 2025-12-04T10:01:24.5949140Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.5949440Z Autotune Choices Stats: 2025-12-04T10:01:24.5951171Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_574", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.5953187Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5953806Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5954484Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5956467Z triton_flex_attention_574 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5959188Z triton_flex_attention_575 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5961844Z triton_flex_attention_572 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.5964536Z triton_flex_attention_570 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5967185Z triton_flex_attention_573 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.5969838Z triton_flex_attention_571 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5971488Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3286 seconds precompiling for 6 choices 2025-12-04T10:01:24.5971921Z Autotune Choices Stats: 2025-12-04T10:01:24.5973694Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014303999952971935, "best_triton_pos": 0} 2025-12-04T10:01:24.5976005Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.5977015Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.5978116Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.5980222Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5982973Z triton_flex_attention_backward_576 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5985747Z triton_flex_attention_backward_577 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5988560Z triton_flex_attention_backward_578 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.5991307Z triton_flex_attention_backward_581 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.5994044Z triton_flex_attention_backward_583 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.5996860Z triton_flex_attention_backward_580 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.5999631Z triton_flex_attention_backward_582 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6002365Z triton_flex_attention_backward_585 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6005131Z triton_flex_attention_backward_588 0.0174 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6006830Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2260 seconds precompiling for 13 choices 2025-12-04T10:01:24.6007337Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6007658Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6007866Z unimplemented [] 2025-12-04T10:01:24.6008080Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6008476Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6010170Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6011697Z graph_break [] 2025-12-04T10:01:24.6011929Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6012220Z Autotune Choices Stats: 2025-12-04T10:01:24.6013946Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.6015976Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6016602Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6017335Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6019121Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6021795Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6024482Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6027137Z triton_flex_attention_589 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6029843Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6032502Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6034156Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.2961 seconds precompiling for 6 choices 2025-12-04T10:01:24.6034630Z Autotune Choices Stats: 2025-12-04T10:01:24.6036444Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.6038692Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6039658Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6040793Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6042873Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6045674Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6048419Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6051166Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6053917Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6056847Z triton_flex_attention_backward_601 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6059659Z triton_flex_attention_backward_602 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6062471Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6065210Z triton_flex_attention_backward_604 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6068054Z triton_flex_attention_backward_603 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6069742Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.3242 seconds precompiling for 13 choices 2025-12-04T10:01:24.6070259Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6070564Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6070762Z unimplemented [] 2025-12-04T10:01:24.6070978Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6071385Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6073064Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6074598Z graph_break [] 2025-12-04T10:01:24.6074832Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6075195Z Autotune Choices Stats: 2025-12-04T10:01:24.6076922Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.6078929Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6079546Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6080235Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6082010Z triton_flex_attention_612 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6084670Z triton_flex_attention_613 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6087367Z triton_flex_attention_608 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6090026Z triton_flex_attention_610 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6092692Z triton_flex_attention_611 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6095349Z triton_flex_attention_609 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6097048Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3064 seconds precompiling for 6 choices 2025-12-04T10:01:24.6097481Z Autotune Choices Stats: 2025-12-04T10:01:24.6099284Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_616", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.6101567Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6102531Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6103626Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6105750Z triton_flex_attention_backward_616 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6108566Z triton_flex_attention_backward_615 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6111343Z triton_flex_attention_backward_617 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6114081Z triton_flex_attention_backward_614 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6116859Z triton_flex_attention_backward_619 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6119629Z triton_flex_attention_backward_620 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6122402Z triton_flex_attention_backward_618 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6125139Z triton_flex_attention_backward_621 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6127909Z triton_flex_attention_backward_623 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6130643Z triton_flex_attention_backward_622 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6132341Z SingleProcess AUTOTUNE benchmarking takes 0.6701 seconds and 2.3164 seconds precompiling for 13 choices 2025-12-04T10:01:24.6132851Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6133151Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6133355Z unimplemented [] 2025-12-04T10:01:24.6133573Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6133976Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6135664Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6137245Z graph_break [] 2025-12-04T10:01:24.6137474Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6137773Z Autotune Choices Stats: 2025-12-04T10:01:24.6139559Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.6141533Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6142181Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6142862Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6144604Z triton_flex_attention_631 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6147367Z triton_flex_attention_632 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6150032Z triton_flex_attention_629 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6152684Z triton_flex_attention_627 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6155489Z triton_flex_attention_630 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6158232Z triton_flex_attention_628 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6159933Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.2948 seconds precompiling for 6 choices 2025-12-04T10:01:24.6160372Z Autotune Choices Stats: 2025-12-04T10:01:24.6162196Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.6164459Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6165425Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6166579Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6168647Z triton_flex_attention_backward_634 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6171406Z triton_flex_attention_backward_635 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6174144Z triton_flex_attention_backward_636 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6176891Z triton_flex_attention_backward_633 0.0144 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6178290Z triton_flex_attention_backward_638 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6179651Z triton_flex_attention_backward_640 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6180978Z triton_flex_attention_backward_637 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6182345Z triton_flex_attention_backward_639 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6183677Z triton_flex_attention_backward_642 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6185012Z triton_flex_attention_backward_641 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6185303Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.2589 seconds precompiling for 13 choices 2025-12-04T10:01:24.6185442Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6185513Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6185585Z unimplemented [] 2025-12-04T10:01:24.6185693Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6185942Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6187383Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6187447Z graph_break [] 2025-12-04T10:01:24.6187626Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6187699Z Autotune Choices Stats: 2025-12-04T10:01:24.6189344Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.6189632Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6189880Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6190236Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6191568Z triton_flex_attention_650 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6192854Z triton_flex_attention_651 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6194143Z triton_flex_attention_646 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6195430Z triton_flex_attention_648 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6196767Z triton_flex_attention_649 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6198077Z triton_flex_attention_647 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6198365Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3235 seconds precompiling for 6 choices 2025-12-04T10:01:24.6198432Z Autotune Choices Stats: 2025-12-04T10:01:24.6200111Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_653", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.6200662Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6201037Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6201680Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6203025Z triton_flex_attention_backward_653 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6204348Z triton_flex_attention_backward_654 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6205677Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6207095Z triton_flex_attention_backward_652 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6208448Z triton_flex_attention_backward_656 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6209780Z triton_flex_attention_backward_657 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6211101Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6212466Z triton_flex_attention_backward_658 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6213793Z triton_flex_attention_backward_661 0.0164 ms 80.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6215123Z triton_flex_attention_backward_660 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6215445Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.4000 seconds precompiling for 13 choices 2025-12-04T10:01:24.6215580Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6215650Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6215721Z unimplemented [] 2025-12-04T10:01:24.6215827Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6216029Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6217463Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6217532Z graph_break [] 2025-12-04T10:01:24.6217671Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6217739Z Autotune Choices Stats: 2025-12-04T10:01:24.6219385Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.6219703Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6219954Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6220317Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6221613Z triton_flex_attention_669 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6222900Z triton_flex_attention_670 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6224182Z triton_flex_attention_665 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6225464Z triton_flex_attention_667 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6226822Z triton_flex_attention_668 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6228199Z triton_flex_attention_666 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6228487Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.2940 seconds precompiling for 6 choices 2025-12-04T10:01:24.6228555Z Autotune Choices Stats: 2025-12-04T10:01:24.6230209Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.6230785Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6231159Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6231796Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6233144Z triton_flex_attention_backward_673 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6234480Z triton_flex_attention_backward_674 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6235851Z triton_flex_attention_backward_671 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6237208Z triton_flex_attention_backward_672 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6238564Z triton_flex_attention_backward_676 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6239909Z triton_flex_attention_backward_675 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6241274Z triton_flex_attention_backward_677 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6242606Z triton_flex_attention_backward_678 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6243938Z triton_flex_attention_backward_680 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6245261Z triton_flex_attention_backward_683 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6245585Z SingleProcess AUTOTUNE benchmarking takes 0.6679 seconds and 2.3879 seconds precompiling for 13 choices 2025-12-04T10:01:24.6245720Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6245790Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6245858Z unimplemented [] 2025-12-04T10:01:24.6246001Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6246208Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6247650Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6247714Z graph_break [] 2025-12-04T10:01:24.6247853Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6247922Z Autotune Choices Stats: 2025-12-04T10:01:24.6249525Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_689", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010367999784648418, "best_triton_pos": 0} 2025-12-04T10:01:24.6249858Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6250103Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6250455Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6251744Z triton_flex_attention_689 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6253036Z triton_flex_attention_688 0.0113 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6254316Z triton_flex_attention_684 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6255882Z triton_flex_attention_686 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6257254Z triton_flex_attention_687 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6258548Z triton_flex_attention_685 0.0164 ms 63.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6258832Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3120 seconds precompiling for 6 choices 2025-12-04T10:01:24.6258948Z Autotune Choices Stats: 2025-12-04T10:01:24.6260605Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_690", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.6261128Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6261498Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6262148Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6263506Z triton_flex_attention_backward_690 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6264843Z triton_flex_attention_backward_691 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6266259Z triton_flex_attention_backward_692 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6267717Z triton_flex_attention_backward_693 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6269049Z triton_flex_attention_backward_695 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6270431Z triton_flex_attention_backward_696 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6271747Z triton_flex_attention_backward_697 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6273084Z triton_flex_attention_backward_694 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6274413Z triton_flex_attention_backward_699 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6275784Z triton_flex_attention_backward_702 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6276101Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.3417 seconds precompiling for 13 choices 2025-12-04T10:01:24.6276243Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6276318Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6276387Z unimplemented [] 2025-12-04T10:01:24.6276494Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6276732Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6278136Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6278209Z graph_break [] 2025-12-04T10:01:24.6278350Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6278451Z Autotune Choices Stats: 2025-12-04T10:01:24.6280054Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.6280338Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6280582Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6280946Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6282243Z triton_flex_attention_707 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6283533Z triton_flex_attention_708 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6284868Z triton_flex_attention_705 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6286180Z triton_flex_attention_706 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6287488Z triton_flex_attention_703 0.0144 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6288779Z triton_flex_attention_704 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6289103Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.2972 seconds precompiling for 6 choices 2025-12-04T10:01:24.6289176Z Autotune Choices Stats: 2025-12-04T10:01:24.6290833Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_709", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.6291353Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6291723Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6292370Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6293720Z triton_flex_attention_backward_709 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6295119Z triton_flex_attention_backward_710 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6296498Z triton_flex_attention_backward_711 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6297828Z triton_flex_attention_backward_712 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6299193Z triton_flex_attention_backward_714 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6300521Z triton_flex_attention_backward_715 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6301845Z triton_flex_attention_backward_713 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6303173Z triton_flex_attention_backward_716 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6304528Z triton_flex_attention_backward_721 0.0174 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6305911Z triton_flex_attention_backward_717 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6306209Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.1509 seconds precompiling for 13 choices 2025-12-04T10:01:24.6306380Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6306452Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6306521Z unimplemented [] 2025-12-04T10:01:24.6306627Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6306829Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6308276Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6308389Z graph_break [] 2025-12-04T10:01:24.6308535Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6308604Z Autotune Choices Stats: 2025-12-04T10:01:24.6310233Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.6310521Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6310762Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6311128Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6312426Z triton_flex_attention_726 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6313752Z triton_flex_attention_727 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6315072Z triton_flex_attention_722 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6316388Z triton_flex_attention_724 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6317670Z triton_flex_attention_725 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6318978Z triton_flex_attention_723 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6319261Z SingleProcess AUTOTUNE benchmarking takes 0.2941 seconds and 1.2814 seconds precompiling for 6 choices 2025-12-04T10:01:24.6319332Z Autotune Choices Stats: 2025-12-04T10:01:24.6320980Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_729", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.6321504Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6321870Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6322519Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6323904Z triton_flex_attention_backward_729 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6325542Z triton_flex_attention_backward_730 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6326934Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6328276Z triton_flex_attention_backward_733 0.0144 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6329643Z triton_flex_attention_backward_728 0.0153 ms 87.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6330972Z triton_flex_attention_backward_735 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6332300Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6333638Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6335031Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6336403Z triton_flex_attention_backward_740 0.0173 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6336686Z SingleProcess AUTOTUNE benchmarking takes 0.6682 seconds and 2.2393 seconds precompiling for 13 choices 2025-12-04T10:01:24.6336828Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6336897Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6336975Z unimplemented [] 2025-12-04T10:01:24.6337083Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6337324Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6338713Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6338776Z graph_break [] 2025-12-04T10:01:24.6338918Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6338986Z Autotune Choices Stats: 2025-12-04T10:01:24.6340597Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.6340886Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6341125Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6341484Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6342774Z triton_flex_attention_745 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6344182Z triton_flex_attention_746 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6345507Z triton_flex_attention_743 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6346787Z triton_flex_attention_741 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6348142Z triton_flex_attention_744 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6349434Z triton_flex_attention_742 0.0164 ms 62.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6349717Z SingleProcess AUTOTUNE benchmarking takes 0.2954 seconds and 1.3187 seconds precompiling for 6 choices 2025-12-04T10:01:24.6349788Z Autotune Choices Stats: 2025-12-04T10:01:24.6351441Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_750", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.6351958Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6352376Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6353021Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6354395Z triton_flex_attention_backward_750 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6355952Z triton_flex_attention_backward_748 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6357293Z triton_flex_attention_backward_749 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6358682Z triton_flex_attention_backward_753 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6360012Z triton_flex_attention_backward_747 0.0144 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6361346Z triton_flex_attention_backward_752 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6362671Z triton_flex_attention_backward_754 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6364115Z triton_flex_attention_backward_751 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6365487Z triton_flex_attention_backward_756 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6366828Z triton_flex_attention_backward_759 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6367141Z SingleProcess AUTOTUNE benchmarking takes 0.6710 seconds and 2.3823 seconds precompiling for 13 choices 2025-12-04T10:01:24.6367342Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:24.6367420Z Traceback (most recent call last): 2025-12-04T10:01:24.6367778Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:24.6367845Z self.assertTrue( 2025-12-04T10:01:24.6368074Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:24.6368162Z raise self.failureException(msg) 2025-12-04T10:01:24.6368440Z AssertionError: False is not true : Log file /tmp/tmpdbsr3fy2/flex_attention_configs.json was not created 2025-12-04T10:01:24.6368445Z 2025-12-04T10:01:24.6368598Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:24.6368891Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:24.6368896Z 2025-12-04T10:01:24.6369073Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:24.6371126Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6371247Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6371315Z unimplemented [] 2025-12-04T10:01:24.6371439Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6372854Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:24.6373068Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6373190Z graph_break [] 2025-12-04T10:01:24.6373340Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6374535Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:24.6374619Z current_size = base.storage().size() 2025-12-04T10:01:24.6374688Z Autotune Choices Stats: 2025-12-04T10:01:24.6376343Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:24.6376636Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6376892Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6377254Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6378566Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6379899Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6381269Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6382544Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6383831Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6385192Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6385481Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:24.6385559Z Autotune Choices Stats: 2025-12-04T10:01:24.6387200Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.6387783Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6388188Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6388837Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6390175Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6391561Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6392877Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6394234Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6395585Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6396913Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6398240Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6399590Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6400938Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6402260Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6402549Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:24.6402689Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6402809Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6402881Z unimplemented [] 2025-12-04T10:01:24.6402994Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6403203Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6404629Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6404696Z graph_break [] 2025-12-04T10:01:24.6404836Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6404905Z Autotune Choices Stats: 2025-12-04T10:01:24.6406502Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.6406793Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6407037Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6407428Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6408735Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6410011Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6411337Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6412613Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6413921Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6415251Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6415533Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:24.6415599Z Autotune Choices Stats: 2025-12-04T10:01:24.6417246Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.6417813Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6418176Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6418817Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6420203Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6421533Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6422850Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6424237Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6425551Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6426880Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6428301Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6429626Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6430993Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6432300Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6432623Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:24.6432759Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6432832Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6432894Z unimplemented [] 2025-12-04T10:01:24.6432999Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6433208Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6434642Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6434713Z graph_break [] 2025-12-04T10:01:24.6434846Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6434913Z Autotune Choices Stats: 2025-12-04T10:01:24.6436510Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.6436838Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6437084Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6437438Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6438731Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6440047Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6441325Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6442627Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6443937Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6445206Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6445479Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:24.6445543Z Autotune Choices Stats: 2025-12-04T10:01:24.6447180Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.6447737Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6448102Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6448742Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6450115Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6451464Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6452870Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6454192Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6455697Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6457036Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6458423Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6459797Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6461125Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6462435Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6462770Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:24.6462906Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6463024Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6463098Z unimplemented [] 2025-12-04T10:01:24.6463202Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6463412Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6464794Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6464857Z graph_break [] 2025-12-04T10:01:24.6464988Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6465053Z Autotune Choices Stats: 2025-12-04T10:01:24.6466660Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.6466984Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6467281Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6467639Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6468968Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6470252Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6471541Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6472887Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6474171Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6475456Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6475778Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:24.6475856Z Autotune Choices Stats: 2025-12-04T10:01:24.6477491Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.6478000Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6478406Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6479047Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6480387Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6481746Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6483100Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6484437Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6485763Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6487125Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6488494Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6489820Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6491141Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6492525Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6492809Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:24.6492943Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6493013Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6493083Z unimplemented [] 2025-12-04T10:01:24.6493186Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6493391Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6494776Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6494875Z graph_break [] 2025-12-04T10:01:24.6495019Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6495089Z Autotune Choices Stats: 2025-12-04T10:01:24.6496702Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.6496983Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6497228Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6497623Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6498915Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6500192Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6501505Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6502823Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6504100Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6505380Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6505694Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:24.6505768Z Autotune Choices Stats: 2025-12-04T10:01:24.6507444Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.6508003Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6508371Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6509027Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6510370Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6511762Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6513094Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6514421Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6515770Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6517102Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6518455Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6519780Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6521134Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6522485Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6522771Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:24.6522906Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6522974Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6523041Z unimplemented [] 2025-12-04T10:01:24.6523155Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6523368Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6524759Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6524879Z graph_break [] 2025-12-04T10:01:24.6525019Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6525086Z Autotune Choices Stats: 2025-12-04T10:01:24.6526691Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.6527014Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6527257Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6527609Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6528904Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6530212Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6531520Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6532803Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6534080Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6535389Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6535676Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:24.6535750Z Autotune Choices Stats: 2025-12-04T10:01:24.6537446Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.6537965Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6538332Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6539009Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6540395Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6541723Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6543053Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6544386Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6545740Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6547097Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6548457Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6549827Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6551190Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6552513Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6552800Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:24.6552939Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6553008Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6553112Z unimplemented [] 2025-12-04T10:01:24.6553218Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6553426Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6554811Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6554873Z graph_break [] 2025-12-04T10:01:24.6555012Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6555076Z Autotune Choices Stats: 2025-12-04T10:01:24.6556889Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.6557182Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6557426Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6557780Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6559082Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6560469Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6561752Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6563024Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6564360Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6565651Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6565938Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:24.6566042Z Autotune Choices Stats: 2025-12-04T10:01:24.6567684Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.6568202Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6568608Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6569249Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6570620Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6571948Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6573281Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6574646Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6576000Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6577327Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6578645Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6580033Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6581361Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6582689Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6583017Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:24.6583155Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6583224Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6583295Z unimplemented [] 2025-12-04T10:01:24.6583400Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6583604Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6584992Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6585055Z graph_break [] 2025-12-04T10:01:24.6585192Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6585296Z Autotune Choices Stats: 2025-12-04T10:01:24.6586907Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:24.6587189Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6587512Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6587866Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6589196Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6590481Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6591771Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6593082Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6594368Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6595683Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6595978Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:24.6596048Z Autotune Choices Stats: 2025-12-04T10:01:24.6597699Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.6598265Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6598667Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6599308Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6600653Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6601975Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6603337Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6604665Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6606092Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6607412Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6608764Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6610121Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6611457Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6612776Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6613097Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:24.6613231Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6613302Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6613369Z unimplemented [] 2025-12-04T10:01:24.6613484Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6613687Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6615113Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6615177Z graph_break [] 2025-12-04T10:01:24.6615315Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6615379Z Autotune Choices Stats: 2025-12-04T10:01:24.6616992Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:24.6617310Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6617551Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6617904Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6619228Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6620505Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6621781Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6623096Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6624410Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6625686Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6625964Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:24.6626032Z Autotune Choices Stats: 2025-12-04T10:01:24.6627733Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.6628323Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6628688Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6629333Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6630679Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6632031Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6633358Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6634731Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6636054Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6637420Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6638776Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6640108Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6641428Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6642787Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6643068Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:24.6643202Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6643272Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6643341Z unimplemented [] 2025-12-04T10:01:24.6643444Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6643682Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6645073Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6645134Z graph_break [] 2025-12-04T10:01:24.6645272Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6645339Z Autotune Choices Stats: 2025-12-04T10:01:24.6646975Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.6647253Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6647538Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6647897Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6649188Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6650482Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6651806Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6653086Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6654399Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6655852Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6656209Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:24.6656278Z Autotune Choices Stats: 2025-12-04T10:01:24.6657971Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.6658491Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6658859Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6659504Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6660847Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6662229Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6663608Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6664937Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6666267Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6667714Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6669072Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6670407Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6671782Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6673112Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6673435Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:24.6673577Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6673647Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6673717Z unimplemented [] 2025-12-04T10:01:24.6673820Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6674023Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6675417Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6675519Z graph_break [] 2025-12-04T10:01:24.6675659Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6675726Z Autotune Choices Stats: 2025-12-04T10:01:24.6677372Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:24.6677661Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6677905Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6678259Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6679554Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6680870Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6682157Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6683469Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6684745Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6686056Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6686341Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:24.6686406Z Autotune Choices Stats: 2025-12-04T10:01:24.6688085Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.6688603Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6688968Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6689644Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6690985Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6692358Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6693700Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6695022Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6696432Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6697749Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6699065Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6700428Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6701748Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6703106Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6703392Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:24.6703526Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6703599Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6703666Z unimplemented [] 2025-12-04T10:01:24.6703768Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6703971Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6705403Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6705463Z graph_break [] 2025-12-04T10:01:24.6705603Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6705669Z Autotune Choices Stats: 2025-12-04T10:01:24.6707380Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.6707669Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6707907Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6708266Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6709559Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6710884Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6712212Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6713495Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6714769Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6716120Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6716403Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:24.6716470Z Autotune Choices Stats: 2025-12-04T10:01:24.6718118Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.6718636Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6719041Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6719688Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6721044Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6722406Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6723741Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6725103Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6726474Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6727798Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6729117Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6730481Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6731842Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6733172Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6733455Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:24.6733630Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6733699Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6733767Z unimplemented [] 2025-12-04T10:01:24.6733871Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6734071Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6735494Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6735559Z graph_break [] 2025-12-04T10:01:24.6735697Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6735766Z Autotune Choices Stats: 2025-12-04T10:01:24.6737377Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:24.6737658Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6737930Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6738292Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6739577Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6740895Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6742174Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6743455Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6744826Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6746106Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6746391Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:24.6746457Z Autotune Choices Stats: 2025-12-04T10:01:24.6748198Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.6748756Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6749122Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6749769Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6751147Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6752478Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6753810Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6755368Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6756731Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6758063Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6759493Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6760824Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6762189Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6763523Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6763851Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:24.6763991Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6764060Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6764127Z unimplemented [] 2025-12-04T10:01:24.6764229Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6764475Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6765875Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6765938Z graph_break [] 2025-12-04T10:01:24.6766081Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6766152Z Autotune Choices Stats: 2025-12-04T10:01:24.6767748Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.6768078Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6768314Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6768673Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6769965Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6771283Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6772566Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6773885Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6775204Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6776486Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6776769Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:24.6776834Z Autotune Choices Stats: 2025-12-04T10:01:24.6778517Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.6779035Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6779402Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6780093Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6781438Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6782767Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6784159Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6785481Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6786815Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6788211Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6789534Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6790907Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6792236Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6793593Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6793871Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:24.6794043Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6794115Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6794183Z unimplemented [] 2025-12-04T10:01:24.6794286Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6794489Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6795890Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6795950Z graph_break [] 2025-12-04T10:01:24.6796089Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6796155Z Autotune Choices Stats: 2025-12-04T10:01:24.6797791Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.6798083Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6798330Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6798692Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6800020Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6801307Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6802612Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6803931Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6805219Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6806505Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6806833Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:24.6806900Z Autotune Choices Stats: 2025-12-04T10:01:24.6808558Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.6809072Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6809476Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6810116Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6811461Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6812832Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6814191Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6815522Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6816854Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6818232Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6819721Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6821060Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6822385Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6823780Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6824070Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:24.6824218Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6824289Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6824357Z unimplemented [] 2025-12-04T10:01:24.6824462Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6824664Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6826068Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6826166Z graph_break [] 2025-12-04T10:01:24.6826307Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6826374Z Autotune Choices Stats: 2025-12-04T10:01:24.6828008Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.6828310Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6828590Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6828950Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6830244Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6831530Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6833061Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6834350Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6835633Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6836964Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6837244Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:24.6837308Z Autotune Choices Stats: 2025-12-04T10:01:24.6838997Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.6839516Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6839883Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6840520Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6841902Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6843260Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6844600Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6845931Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6847297Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6848668Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6849992Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6851320Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6852719Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6854049Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6854334Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:24.6854474Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6854543Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6854605Z unimplemented [] 2025-12-04T10:01:24.6854719Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6854925Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6857061Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6857152Z graph_break [] 2025-12-04T10:01:24.6857306Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6857378Z Autotune Choices Stats: 2025-12-04T10:01:24.6859092Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.6859399Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6859645Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6860006Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6861327Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6862732Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6864019Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6865313Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6866581Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6868002Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6868295Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:24.6868366Z Autotune Choices Stats: 2025-12-04T10:01:24.6870063Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.6870591Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6870960Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6871651Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6873028Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6874352Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6875677Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6877031Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6878360Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6879714Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6881039Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6882412Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6883763Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6885093Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6885375Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:24.6885557Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6885631Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6885695Z unimplemented [] 2025-12-04T10:01:24.6885811Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6886028Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6887426Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6887489Z graph_break [] 2025-12-04T10:01:24.6887631Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6887700Z Autotune Choices Stats: 2025-12-04T10:01:24.6889359Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.6889653Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6889896Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6890265Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6891605Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6892958Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6894260Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6895574Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6896921Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6898244Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6898531Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:24.6898599Z Autotune Choices Stats: 2025-12-04T10:01:24.6900255Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.6900815Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6901182Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6901866Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6903204Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6904558Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6905938Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6907312Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6908691Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6910024Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6911392Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6912760Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6914096Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6915431Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6915747Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:24.6915890Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6915960Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6916021Z unimplemented [] 2025-12-04T10:01:24.6916132Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6916337Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6917766Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6917831Z graph_break [] 2025-12-04T10:01:24.6917966Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6918038Z Autotune Choices Stats: 2025-12-04T10:01:24.6919654Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.6919989Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6920230Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6920590Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6922018Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6923556Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6924924Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6926241Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6927534Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6928915Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6929268Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:24.6929354Z Autotune Choices Stats: 2025-12-04T10:01:24.6931163Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.6931739Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6932134Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6932785Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6934123Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6935471Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6936849Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6938206Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6939538Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6940872Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6942274Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6943604Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6944927Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6946295Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6946573Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:24.6946716Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6946784Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6946846Z unimplemented [] 2025-12-04T10:01:24.6946958Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6947160Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6948656Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6948721Z graph_break [] 2025-12-04T10:01:24.6953078Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6953196Z Autotune Choices Stats: 2025-12-04T10:01:24.6954834Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.6955404Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6955707Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6956158Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6957470Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6958757Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6960105Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6961409Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6962753Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6964051Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6964349Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:24.6964470Z Autotune Choices Stats: 2025-12-04T10:01:24.6966170Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.6966704Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6967077Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6967732Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6969079Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6970462Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6971798Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6973187Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6974518Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.6975919Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6977257Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.6978585Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.6979921Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6981307Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.6981593Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:24.6981748Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.6981858Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.6981926Z unimplemented [] 2025-12-04T10:01:24.6982044Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.6982254Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.6983685Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.6983751Z graph_break [] 2025-12-04T10:01:24.6983932Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.6984006Z Autotune Choices Stats: 2025-12-04T10:01:24.6985653Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.6985959Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6986201Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6986575Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6987963Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6989264Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6990604Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6991947Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.6993244Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.6994535Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.6994858Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:24.6994933Z Autotune Choices Stats: 2025-12-04T10:01:24.6996646Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.6997189Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.6997552Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.6998219Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.6999602Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7000963Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7002339Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7003671Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7005045Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7006418Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7007758Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7009092Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7010473Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7011873Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7012165Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:24.7012314Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7012387Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7012452Z unimplemented [] 2025-12-04T10:01:24.7012569Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7012779Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7014194Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7014296Z graph_break [] 2025-12-04T10:01:24.7014431Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7014509Z Autotune Choices Stats: 2025-12-04T10:01:24.7016156Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.7016463Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7016703Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7017080Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7018383Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7019714Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7021040Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7022338Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7023632Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7024962Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7025278Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:24.7025358Z Autotune Choices Stats: 2025-12-04T10:01:24.7027027Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:24.7027619Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7027987Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7028692Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7030039Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7031416Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7032759Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7034086Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7035492Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7036839Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7038173Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7039544Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7040884Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7042264Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7042556Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:24.7042699Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7042775Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7042843Z unimplemented [] 2025-12-04T10:01:24.7042993Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7043203Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7044618Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7044740Z graph_break [] 2025-12-04T10:01:24.7044878Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7044954Z Autotune Choices Stats: 2025-12-04T10:01:24.7046568Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.7046862Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7047104Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7047514Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7048815Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7050111Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7051428Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7052724Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7054059Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7055669Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7055994Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:24.7056072Z Autotune Choices Stats: 2025-12-04T10:01:24.7057731Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.7058321Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7058690Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7059350Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7060769Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7062125Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7063457Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7064883Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7066221Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7067618Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7069001Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7070344Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7071726Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7073076Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7073404Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:24.7073551Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7073624Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7073690Z unimplemented [] 2025-12-04T10:01:24.7073803Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7074010Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7075472Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7075544Z graph_break [] 2025-12-04T10:01:24.7075681Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7075758Z Autotune Choices Stats: 2025-12-04T10:01:24.7077383Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.7077712Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7077958Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7078328Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7079644Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7080979Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7082275Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7083580Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7084958Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7086274Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7086560Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:24.7086636Z Autotune Choices Stats: 2025-12-04T10:01:24.7088297Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.7088863Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7089226Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7089883Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7091271Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7092630Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7094003Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7095375Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7096730Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7098070Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7099451Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7100817Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7102185Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7103525Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7103848Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:24.7103984Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7104065Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7104165Z unimplemented [] 2025-12-04T10:01:24.7104280Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7104486Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7105906Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7105979Z graph_break [] 2025-12-04T10:01:24.7106114Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7106191Z Autotune Choices Stats: 2025-12-04T10:01:24.7107851Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.7108190Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7108429Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7108792Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7110137Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7111442Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7112732Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7114089Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7115374Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7116662Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7116978Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:24.7117054Z Autotune Choices Stats: 2025-12-04T10:01:24.7118708Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.7119246Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7119664Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7120329Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7121678Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7123061Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7124434Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7125777Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7127112Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7128475Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7129852Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7131177Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7132511Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7133916Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7134202Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:24.7134341Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7134427Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7134491Z unimplemented [] 2025-12-04T10:01:24.7134604Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7134807Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7136227Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7136296Z graph_break [] 2025-12-04T10:01:24.7136469Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7136543Z Autotune Choices Stats: 2025-12-04T10:01:24.7138152Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.7138446Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7138685Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7139048Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7140379Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7141667Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7142994Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7144328Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7145614Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7146916Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7147293Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:24.7147369Z Autotune Choices Stats: 2025-12-04T10:01:24.7149025Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:24.7149589Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7149955Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7150615Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7151955Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7153368Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7154696Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7156250Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7157706Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7159046Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7160440Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7161775Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7163169Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7164552Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7164846Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:24.7164985Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7165062Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7165129Z unimplemented [] 2025-12-04T10:01:24.7165243Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7165449Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7166878Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7167001Z graph_break [] 2025-12-04T10:01:24.7167138Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7167213Z Autotune Choices Stats: 2025-12-04T10:01:24.7168817Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.7169157Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7169400Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7169760Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7171072Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7172392Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7173728Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7175018Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7176316Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7177647Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7177929Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:24.7178004Z Autotune Choices Stats: 2025-12-04T10:01:24.7179692Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.7180220Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7180592Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7181256Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7182639Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7184020Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7185367Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7186703Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7188133Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7189499Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7190836Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7192168Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7193596Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7194933Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7195219Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:24.7195358Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7195436Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7195502Z unimplemented [] 2025-12-04T10:01:24.7195650Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7195866Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7197277Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7197348Z graph_break [] 2025-12-04T10:01:24.7197485Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7197560Z Autotune Choices Stats: 2025-12-04T10:01:24.7199203Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.7199501Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7199753Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7200114Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7201422Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7202782Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7204081Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7205368Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7206698Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7207996Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7208279Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:24.7208398Z Autotune Choices Stats: 2025-12-04T10:01:24.7210068Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.7210595Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7210996Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7211653Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7213042Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7214384Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7215730Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7217098Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7218479Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7219810Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7221144Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7222547Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7223887Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7225219Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7225550Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:24.7225687Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7225766Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7225831Z unimplemented [] 2025-12-04T10:01:24.7225943Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7226154Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7227616Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7227689Z graph_break [] 2025-12-04T10:01:24.7227824Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7227941Z Autotune Choices Stats: 2025-12-04T10:01:24.7229539Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.7229829Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7230069Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7230485Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7231826Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7233124Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7234419Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7235706Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7237041Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7238368Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7238656Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:24.7238730Z Autotune Choices Stats: 2025-12-04T10:01:24.7240392Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:24.7240954Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7241319Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7242007Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7243360Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7244703Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7246098Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7247431Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7248805Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7250141Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7251505Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7252868Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7254203Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7255736Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7256110Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:24.7256255Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7256332Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7256396Z unimplemented [] 2025-12-04T10:01:24.7256502Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7256717Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7258186Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7258259Z graph_break [] 2025-12-04T10:01:24.7258398Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7258465Z Autotune Choices Stats: 2025-12-04T10:01:24.7260097Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.7260436Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7260678Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7261047Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7262405Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7263698Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7264988Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7266310Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7267698Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7268993Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7269273Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:24.7269352Z Autotune Choices Stats: 2025-12-04T10:01:24.7270999Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.7271588Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7271951Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7272610Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7273951Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7275332Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7276674Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7278066Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7279402Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7280774Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7282148Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7283481Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7284811Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7286195Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7286491Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:24.7286626Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7286703Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7286770Z unimplemented [] 2025-12-04T10:01:24.7286873Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7287121Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7288527Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7288599Z graph_break [] 2025-12-04T10:01:24.7288733Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7288803Z Autotune Choices Stats: 2025-12-04T10:01:24.7290414Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_574", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.7290737Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7291022Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7291385Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7292687Z triton_flex_attention_574 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7293972Z triton_flex_attention_575 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7295302Z triton_flex_attention_572 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7296607Z triton_flex_attention_570 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7297925Z triton_flex_attention_573 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7299222Z triton_flex_attention_571 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7299540Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3286 seconds precompiling for 6 choices 2025-12-04T10:01:24.7299614Z Autotune Choices Stats: 2025-12-04T10:01:24.7301305Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014303999952971935, "best_triton_pos": 0} 2025-12-04T10:01:24.7301839Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7302202Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7302854Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7304199Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7305600Z triton_flex_attention_backward_576 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7306965Z triton_flex_attention_backward_577 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7308380Z triton_flex_attention_backward_578 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7309713Z triton_flex_attention_backward_581 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7311120Z triton_flex_attention_backward_583 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7312457Z triton_flex_attention_backward_580 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7313779Z triton_flex_attention_backward_582 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7315151Z triton_flex_attention_backward_585 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7316484Z triton_flex_attention_backward_588 0.0174 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7316819Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2260 seconds precompiling for 13 choices 2025-12-04T10:01:24.7316958Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7317035Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7317102Z unimplemented [] 2025-12-04T10:01:24.7317208Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7317419Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7318826Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7318934Z graph_break [] 2025-12-04T10:01:24.7319070Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7319139Z Autotune Choices Stats: 2025-12-04T10:01:24.7320786Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.7321076Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7321324Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7321684Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7322991Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7324331Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7325635Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7326975Z triton_flex_attention_589 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7328256Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7329591Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7329873Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.2961 seconds precompiling for 6 choices 2025-12-04T10:01:24.7329947Z Autotune Choices Stats: 2025-12-04T10:01:24.7331639Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.7332165Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7332527Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7333222Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7334567Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7335946Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7337278Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7338615Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7340032Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7341367Z triton_flex_attention_backward_601 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7342702Z triton_flex_attention_backward_602 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7344032Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7345406Z triton_flex_attention_backward_604 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7346775Z triton_flex_attention_backward_603 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7347062Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.3242 seconds precompiling for 13 choices 2025-12-04T10:01:24.7347194Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7347347Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7347415Z unimplemented [] 2025-12-04T10:01:24.7347523Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7347737Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7349183Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7349253Z graph_break [] 2025-12-04T10:01:24.7349385Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7349452Z Autotune Choices Stats: 2025-12-04T10:01:24.7351100Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.7351386Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7351626Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7351984Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7353284Z triton_flex_attention_612 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7354606Z triton_flex_attention_613 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7356106Z triton_flex_attention_608 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7357430Z triton_flex_attention_610 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7358731Z triton_flex_attention_611 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7360114Z triton_flex_attention_609 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7360406Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3064 seconds precompiling for 6 choices 2025-12-04T10:01:24.7360482Z Autotune Choices Stats: 2025-12-04T10:01:24.7362132Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_616", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.7362655Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7363079Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7363739Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7365098Z triton_flex_attention_backward_616 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7366491Z triton_flex_attention_backward_615 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7367836Z triton_flex_attention_backward_617 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7369212Z triton_flex_attention_backward_614 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7370570Z triton_flex_attention_backward_619 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7371914Z triton_flex_attention_backward_620 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7373264Z triton_flex_attention_backward_618 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7374839Z triton_flex_attention_backward_621 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7376223Z triton_flex_attention_backward_623 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7377551Z triton_flex_attention_backward_622 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7377839Z SingleProcess AUTOTUNE benchmarking takes 0.6701 seconds and 2.3164 seconds precompiling for 13 choices 2025-12-04T10:01:24.7377981Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7378113Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7378191Z unimplemented [] 2025-12-04T10:01:24.7378302Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7378510Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7379966Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7380038Z graph_break [] 2025-12-04T10:01:24.7380175Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7380245Z Autotune Choices Stats: 2025-12-04T10:01:24.7381859Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.7382146Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7382429Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7382797Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7384099Z triton_flex_attention_631 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7385425Z triton_flex_attention_632 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7386732Z triton_flex_attention_629 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7388080Z triton_flex_attention_627 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7389466Z triton_flex_attention_630 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7390775Z triton_flex_attention_628 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7391063Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.2948 seconds precompiling for 6 choices 2025-12-04T10:01:24.7391131Z Autotune Choices Stats: 2025-12-04T10:01:24.7392795Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.7393370Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7393732Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7394390Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7395835Z triton_flex_attention_backward_634 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7397434Z triton_flex_attention_backward_635 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7398895Z triton_flex_attention_backward_636 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7400299Z triton_flex_attention_backward_633 0.0144 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7401638Z triton_flex_attention_backward_638 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7402982Z triton_flex_attention_backward_640 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7404347Z triton_flex_attention_backward_637 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7405675Z triton_flex_attention_backward_639 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7407066Z triton_flex_attention_backward_642 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7408397Z triton_flex_attention_backward_641 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7408719Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.2589 seconds precompiling for 13 choices 2025-12-04T10:01:24.7408858Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7408931Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7408996Z unimplemented [] 2025-12-04T10:01:24.7409103Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7409346Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7410753Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7410822Z graph_break [] 2025-12-04T10:01:24.7410954Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7411022Z Autotune Choices Stats: 2025-12-04T10:01:24.7412641Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.7412969Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7413212Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7413574Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7414880Z triton_flex_attention_650 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7416230Z triton_flex_attention_651 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7417525Z triton_flex_attention_646 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7418851Z triton_flex_attention_648 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7420181Z triton_flex_attention_649 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7421476Z triton_flex_attention_647 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7421759Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3235 seconds precompiling for 6 choices 2025-12-04T10:01:24.7421826Z Autotune Choices Stats: 2025-12-04T10:01:24.7423495Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_653", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.7424050Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7424416Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7425103Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7426464Z triton_flex_attention_backward_653 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7427851Z triton_flex_attention_backward_654 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7429269Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7430610Z triton_flex_attention_backward_652 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7431936Z triton_flex_attention_backward_656 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7433301Z triton_flex_attention_backward_657 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7434639Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7436013Z triton_flex_attention_backward_658 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7437351Z triton_flex_attention_backward_661 0.0164 ms 80.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7438716Z triton_flex_attention_backward_660 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7439006Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.4000 seconds precompiling for 13 choices 2025-12-04T10:01:24.7439178Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7439257Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7439324Z unimplemented [] 2025-12-04T10:01:24.7439431Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7439639Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7441043Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7441120Z graph_break [] 2025-12-04T10:01:24.7441258Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7441327Z Autotune Choices Stats: 2025-12-04T10:01:24.7442935Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.7443261Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7443508Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7443870Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7445218Z triton_flex_attention_669 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7446518Z triton_flex_attention_670 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7447852Z triton_flex_attention_665 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7449174Z triton_flex_attention_667 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7450458Z triton_flex_attention_668 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7451750Z triton_flex_attention_666 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7452087Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.2940 seconds precompiling for 6 choices 2025-12-04T10:01:24.7452157Z Autotune Choices Stats: 2025-12-04T10:01:24.7453833Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.7454359Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7454764Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7455632Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7456990Z triton_flex_attention_backward_673 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7458415Z triton_flex_attention_backward_674 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7459794Z triton_flex_attention_backward_671 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7461137Z triton_flex_attention_backward_672 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7462473Z triton_flex_attention_backward_676 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7463887Z triton_flex_attention_backward_675 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7465264Z triton_flex_attention_backward_677 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7466609Z triton_flex_attention_backward_678 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7467998Z triton_flex_attention_backward_680 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7469408Z triton_flex_attention_backward_683 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7469703Z SingleProcess AUTOTUNE benchmarking takes 0.6679 seconds and 2.3879 seconds precompiling for 13 choices 2025-12-04T10:01:24.7469844Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7469917Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7469987Z unimplemented [] 2025-12-04T10:01:24.7470094Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7470304Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7471718Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7471826Z graph_break [] 2025-12-04T10:01:24.7471965Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7472033Z Autotune Choices Stats: 2025-12-04T10:01:24.7473643Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_689", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010367999784648418, "best_triton_pos": 0} 2025-12-04T10:01:24.7473929Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7474215Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7474592Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7475960Z triton_flex_attention_689 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7477513Z triton_flex_attention_688 0.0113 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7478934Z triton_flex_attention_684 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7480227Z triton_flex_attention_686 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7481515Z triton_flex_attention_687 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7482844Z triton_flex_attention_685 0.0164 ms 63.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7483128Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3120 seconds precompiling for 6 choices 2025-12-04T10:01:24.7483195Z Autotune Choices Stats: 2025-12-04T10:01:24.7484888Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_690", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.7485408Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7485786Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7486447Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7487834Z triton_flex_attention_backward_690 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7489217Z triton_flex_attention_backward_691 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7490564Z triton_flex_attention_backward_692 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7491905Z triton_flex_attention_backward_693 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7493282Z triton_flex_attention_backward_695 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7494653Z triton_flex_attention_backward_696 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7495980Z triton_flex_attention_backward_697 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7497309Z triton_flex_attention_backward_694 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7498716Z triton_flex_attention_backward_699 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7500042Z triton_flex_attention_backward_702 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7500331Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.3417 seconds precompiling for 13 choices 2025-12-04T10:01:24.7500469Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7500539Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7500608Z unimplemented [] 2025-12-04T10:01:24.7500714Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7500924Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7502362Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7502438Z graph_break [] 2025-12-04T10:01:24.7502579Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7502647Z Autotune Choices Stats: 2025-12-04T10:01:24.7504299Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.7504586Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7504828Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7505189Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7506492Z triton_flex_attention_707 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7507859Z triton_flex_attention_708 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7509196Z triton_flex_attention_705 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7510496Z triton_flex_attention_706 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7511789Z triton_flex_attention_703 0.0144 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7513118Z triton_flex_attention_704 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7513405Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.2972 seconds precompiling for 6 choices 2025-12-04T10:01:24.7513476Z Autotune Choices Stats: 2025-12-04T10:01:24.7515170Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_709", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.7515710Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7516156Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7517011Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7518480Z triton_flex_attention_backward_709 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7519820Z triton_flex_attention_backward_710 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7521158Z triton_flex_attention_backward_711 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7522525Z triton_flex_attention_backward_712 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7523863Z triton_flex_attention_backward_714 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7525233Z triton_flex_attention_backward_715 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7526576Z triton_flex_attention_backward_713 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7527974Z triton_flex_attention_backward_716 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7529343Z triton_flex_attention_backward_721 0.0174 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7530665Z triton_flex_attention_backward_717 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7530956Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.1509 seconds precompiling for 13 choices 2025-12-04T10:01:24.7531147Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7531216Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7531289Z unimplemented [] 2025-12-04T10:01:24.7531397Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7531609Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7533012Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7533075Z graph_break [] 2025-12-04T10:01:24.7533221Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7533291Z Autotune Choices Stats: 2025-12-04T10:01:24.7534937Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.7535224Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7535473Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7535836Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7537180Z triton_flex_attention_726 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7538494Z triton_flex_attention_727 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7539800Z triton_flex_attention_722 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7541097Z triton_flex_attention_724 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7542433Z triton_flex_attention_725 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7543753Z triton_flex_attention_723 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7544049Z SingleProcess AUTOTUNE benchmarking takes 0.2941 seconds and 1.2814 seconds precompiling for 6 choices 2025-12-04T10:01:24.7544118Z Autotune Choices Stats: 2025-12-04T10:01:24.7545783Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_729", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.7546341Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7546707Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7547568Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7549183Z triton_flex_attention_backward_729 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7550523Z triton_flex_attention_backward_730 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7551873Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7553243Z triton_flex_attention_backward_733 0.0144 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7554625Z triton_flex_attention_backward_728 0.0153 ms 87.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7556123Z triton_flex_attention_backward_735 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7557538Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7558933Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7560279Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7561607Z triton_flex_attention_backward_740 0.0173 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7561956Z SingleProcess AUTOTUNE benchmarking takes 0.6682 seconds and 2.2393 seconds precompiling for 13 choices 2025-12-04T10:01:24.7562092Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7562160Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7562230Z unimplemented [] 2025-12-04T10:01:24.7562335Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7562546Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7563962Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7564097Z graph_break [] 2025-12-04T10:01:24.7564239Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7564308Z Autotune Choices Stats: 2025-12-04T10:01:24.7565933Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.7566268Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7566512Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7566872Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7568213Z triton_flex_attention_745 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7569495Z triton_flex_attention_746 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7570786Z triton_flex_attention_743 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7572122Z triton_flex_attention_741 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7573412Z triton_flex_attention_744 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7574736Z triton_flex_attention_742 0.0164 ms 62.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7575024Z SingleProcess AUTOTUNE benchmarking takes 0.2954 seconds and 1.3187 seconds precompiling for 6 choices 2025-12-04T10:01:24.7575090Z Autotune Choices Stats: 2025-12-04T10:01:24.7576755Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_750", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.7577312Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7577711Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7578374Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7579728Z triton_flex_attention_backward_750 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7581059Z triton_flex_attention_backward_748 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7582442Z triton_flex_attention_backward_749 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7583807Z triton_flex_attention_backward_753 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7585140Z triton_flex_attention_backward_747 0.0144 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7586475Z triton_flex_attention_backward_752 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7587925Z triton_flex_attention_backward_754 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7589277Z triton_flex_attention_backward_751 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7590622Z triton_flex_attention_backward_756 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7591990Z triton_flex_attention_backward_759 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7592277Z SingleProcess AUTOTUNE benchmarking takes 0.6710 seconds and 2.3823 seconds precompiling for 13 choices 2025-12-04T10:01:24.7592411Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7592483Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7592559Z unimplemented [] 2025-12-04T10:01:24.7592664Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7592871Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7594331Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7594396Z graph_break [] 2025-12-04T10:01:24.7594540Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7594609Z Autotune Choices Stats: 2025-12-04T10:01:24.7596231Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_765", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010304000228643417, "best_triton_pos": 0} 2025-12-04T10:01:24.7596555Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7596798Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7597197Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7598494Z triton_flex_attention_765 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7599783Z triton_flex_attention_764 0.0113 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7601125Z triton_flex_attention_762 0.0133 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7602424Z triton_flex_attention_760 0.0143 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7603756Z triton_flex_attention_763 0.0143 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7605051Z triton_flex_attention_761 0.0154 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7605338Z SingleProcess AUTOTUNE benchmarking takes 0.2951 seconds and 1.3301 seconds precompiling for 6 choices 2025-12-04T10:01:24.7605445Z Autotune Choices Stats: 2025-12-04T10:01:24.7607154Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_767", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.7607676Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7608052Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7608705Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7610062Z triton_flex_attention_backward_767 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7611438Z triton_flex_attention_backward_769 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7612788Z triton_flex_attention_backward_766 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7614161Z triton_flex_attention_backward_768 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7615505Z triton_flex_attention_backward_771 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7616891Z triton_flex_attention_backward_772 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7618277Z triton_flex_attention_backward_770 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7619617Z triton_flex_attention_backward_773 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7620952Z triton_flex_attention_backward_775 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7622322Z triton_flex_attention_backward_778 0.0174 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7622611Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2444 seconds precompiling for 13 choices 2025-12-04T10:01:24.7622811Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:24.7622891Z Traceback (most recent call last): 2025-12-04T10:01:24.7623284Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:24.7623352Z self.assertTrue( 2025-12-04T10:01:24.7623578Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:24.7623666Z raise self.failureException(msg) 2025-12-04T10:01:24.7623944Z AssertionError: False is not true : Log file /tmp/tmpfozg11dp/flex_attention_configs.json was not created 2025-12-04T10:01:24.7623949Z 2025-12-04T10:01:24.7624099Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:24.7624386Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:24.7624391Z 2025-12-04T10:01:24.7624568Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:24.7624748Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7624821Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7624889Z unimplemented [] 2025-12-04T10:01:24.7625004Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7626468Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:24.7626683Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7626746Z graph_break [] 2025-12-04T10:01:24.7626883Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7628119Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:24.7628212Z current_size = base.storage().size() 2025-12-04T10:01:24.7628282Z Autotune Choices Stats: 2025-12-04T10:01:24.7629883Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:24.7630218Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7630461Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7630828Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7632154Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7633440Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7634727Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7636109Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7637403Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7638686Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7638977Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:24.7639080Z Autotune Choices Stats: 2025-12-04T10:01:24.7640725Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.7641249Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7641606Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7642315Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7643655Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7644986Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7646378Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7647701Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7649032Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7650394Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7651719Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7653079Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7654402Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7655959Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7656321Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:24.7656472Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7656546Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7656611Z unimplemented [] 2025-12-04T10:01:24.7656727Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7656934Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7658329Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7658394Z graph_break [] 2025-12-04T10:01:24.7658527Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7658658Z Autotune Choices Stats: 2025-12-04T10:01:24.7660267Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.7660560Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7660813Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7661177Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7662526Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7663812Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7665142Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7666465Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7667823Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7669129Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7669462Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:24.7669532Z Autotune Choices Stats: 2025-12-04T10:01:24.7671190Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.7671754Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7672122Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7672778Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7674123Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7675546Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7676876Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7678207Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7679573Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7680903Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7682269Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7683600Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7684960Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7686317Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7686608Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:24.7686751Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7686821Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7686886Z unimplemented [] 2025-12-04T10:01:24.7687001Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7687207Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7688603Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7688706Z graph_break [] 2025-12-04T10:01:24.7688844Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7688916Z Autotune Choices Stats: 2025-12-04T10:01:24.7690524Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.7690817Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7691093Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7691463Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7692761Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7694092Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7695409Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7696700Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7697975Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7699302Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7699581Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:24.7699651Z Autotune Choices Stats: 2025-12-04T10:01:24.7701337Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.7701864Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7702232Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7702880Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7704259Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7705627Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7706962Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7708340Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7709711Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7711082Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7712406Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7713749Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7715131Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7716453Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7716732Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:24.7716871Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7716942Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7717005Z unimplemented [] 2025-12-04T10:01:24.7717115Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7717357Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7718760Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7718821Z graph_break [] 2025-12-04T10:01:24.7718954Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7719028Z Autotune Choices Stats: 2025-12-04T10:01:24.7720671Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.7720965Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7721209Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7721570Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7722867Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7724223Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7725522Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7726808Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7728111Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7729410Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7729690Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:24.7729766Z Autotune Choices Stats: 2025-12-04T10:01:24.7731440Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.7731981Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7732349Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7733040Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7734407Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7735793Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7737372Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7738744Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7740081Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7741440Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7742772Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7744178Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7745505Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7747105Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7747499Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:24.7747737Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7747824Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7747898Z unimplemented [] 2025-12-04T10:01:24.7748017Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7748223Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7749622Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7749686Z graph_break [] 2025-12-04T10:01:24.7749818Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7749897Z Autotune Choices Stats: 2025-12-04T10:01:24.7751533Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.7751823Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7752062Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7752467Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7753760Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7755083Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7756627Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7757926Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7759278Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7760622Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7760906Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:24.7760981Z Autotune Choices Stats: 2025-12-04T10:01:24.7762629Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.7763207Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7763567Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7764269Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7765631Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7767238Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7768627Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7769954Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7771323Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7772641Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7774005Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7775360Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7776689Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7778018Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7778332Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:24.7778473Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7778543Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7778606Z unimplemented [] 2025-12-04T10:01:24.7778717Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7778919Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7780351Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7780422Z graph_break [] 2025-12-04T10:01:24.7780557Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7780629Z Autotune Choices Stats: 2025-12-04T10:01:24.7782219Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.7782545Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7782784Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7783147Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7784492Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7785779Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7787056Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7788464Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7789736Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7791070Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7791355Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:24.7791432Z Autotune Choices Stats: 2025-12-04T10:01:24.7793074Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.7793643Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7794042Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7794697Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7796032Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7797374Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7798748Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7800116Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7801448Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7802773Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7804171Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7805501Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7806849Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7808222Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7808503Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:24.7808644Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7808713Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7808779Z unimplemented [] 2025-12-04T10:01:24.7808888Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7809095Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7810521Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7810592Z graph_break [] 2025-12-04T10:01:24.7810728Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7810803Z Autotune Choices Stats: 2025-12-04T10:01:24.7812401Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.7812727Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7813015Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7813378Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7814681Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7815969Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7817298Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7818600Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7819919Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7821213Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7821543Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:24.7821617Z Autotune Choices Stats: 2025-12-04T10:01:24.7823299Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.7823823Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7824195Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7824841Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7826200Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7827626Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7828997Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7830355Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7831689Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7833087Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7834419Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7835743Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7837119Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7838444Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7838724Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:24.7838894Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7838970Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7839034Z unimplemented [] 2025-12-04T10:01:24.7839142Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7839343Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7840735Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7840842Z graph_break [] 2025-12-04T10:01:24.7840974Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7841048Z Autotune Choices Stats: 2025-12-04T10:01:24.7842679Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:24.7842972Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7843225Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7843583Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7844875Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7846175Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7847497Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7848820Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7850106Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7851400Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7851712Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:24.7851785Z Autotune Choices Stats: 2025-12-04T10:01:24.7853464Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.7853991Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7854355Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7855013Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7856709Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7858050Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7859471Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7860803Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7862204Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7863583Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7864919Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7866249Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7867702Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7869075Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7869361Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:24.7869498Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7869574Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7869638Z unimplemented [] 2025-12-04T10:01:24.7869751Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7869956Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7871349Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7871457Z graph_break [] 2025-12-04T10:01:24.7871591Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7871669Z Autotune Choices Stats: 2025-12-04T10:01:24.7873314Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:24.7873607Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7873847Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7874205Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7875504Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7876825Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7878139Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7879425Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7880714Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7882077Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7882360Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:24.7882440Z Autotune Choices Stats: 2025-12-04T10:01:24.7884080Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.7884608Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7884972Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7885828Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7887177Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7888548Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7889875Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7891203Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7892608Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7893939Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7895279Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7896662Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7898011Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7899387Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7899670Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:24.7899807Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7899887Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7899985Z unimplemented [] 2025-12-04T10:01:24.7900091Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7900304Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7901721Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7901791Z graph_break [] 2025-12-04T10:01:24.7901923Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7901994Z Autotune Choices Stats: 2025-12-04T10:01:24.7903592Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.7903885Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7904129Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7904518Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7905823Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7907102Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7908484Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7909777Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7911095Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7912512Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7912816Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:24.7912890Z Autotune Choices Stats: 2025-12-04T10:01:24.7914544Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.7915100Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7915463Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7916126Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7917502Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7918846Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7920175Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7921578Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7922909Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7924237Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7925605Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7926932Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7928299Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7929634Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7929951Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:24.7930086Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7930161Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7930225Z unimplemented [] 2025-12-04T10:01:24.7930329Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7930533Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7931976Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7932051Z graph_break [] 2025-12-04T10:01:24.7932183Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7932269Z Autotune Choices Stats: 2025-12-04T10:01:24.7933873Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:24.7934206Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7934449Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7934804Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7936102Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7937414Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7938710Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7940042Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7941357Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7942656Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7942937Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:24.7943010Z Autotune Choices Stats: 2025-12-04T10:01:24.7944658Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.7945217Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7945583Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7946247Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7947677Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7949019Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7950383Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7951743Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7953073Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7954396Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7956022Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7957428Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7958761Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7960086Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7960424Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:24.7960564Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7960694Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7960762Z unimplemented [] 2025-12-04T10:01:24.7960868Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7961083Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7962474Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7962541Z graph_break [] 2025-12-04T10:01:24.7962675Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7962743Z Autotune Choices Stats: 2025-12-04T10:01:24.7964362Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.7964706Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7964945Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7965305Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7966649Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7967941Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7969252Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.7970623Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.7971912Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7973195Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7973511Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:24.7973585Z Autotune Choices Stats: 2025-12-04T10:01:24.7975231Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.7975763Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7976169Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7976828Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7978158Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7979537Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7980896Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7982236Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7983567Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7984939Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.7986310Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.7987722Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.7989042Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.7990442Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7990729Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:24.7990868Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.7990944Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.7991009Z unimplemented [] 2025-12-04T10:01:24.7991115Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.7991324Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.7992712Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.7992816Z graph_break [] 2025-12-04T10:01:24.7992950Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.7993020Z Autotune Choices Stats: 2025-12-04T10:01:24.7994624Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:24.7994908Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.7995153Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.7995552Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.7996851Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7998142Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.7999494Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8000796Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8002085Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8003370Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8003689Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:24.8003762Z Autotune Choices Stats: 2025-12-04T10:01:24.8005413Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.8005995Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8006363Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8007015Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8008359Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8009758Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8011089Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8012414Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8013774Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8015108Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8016471Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8017788Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8019159Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8020525Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8020815Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:24.8020952Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8021038Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8021103Z unimplemented [] 2025-12-04T10:01:24.8021210Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8021422Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8022815Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8022923Z graph_break [] 2025-12-04T10:01:24.8023056Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8023122Z Autotune Choices Stats: 2025-12-04T10:01:24.8024727Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.8025047Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8025292Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8025645Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8026946Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8028326Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8029661Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8030962Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8032251Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8033581Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8033868Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:24.8033945Z Autotune Choices Stats: 2025-12-04T10:01:24.8035623Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.8036151Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8036523Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8037205Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8038570Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8039912Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8041243Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8042625Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8043956Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8045338Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8046687Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8048046Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8049408Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8050740Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.8051023Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:24.8051157Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8051230Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8051328Z unimplemented [] 2025-12-04T10:01:24.8051433Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8051653Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8053057Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8053125Z graph_break [] 2025-12-04T10:01:24.8053260Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8053326Z Autotune Choices Stats: 2025-12-04T10:01:24.8054971Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.8055494Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8055773Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8056125Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8057429Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8058830Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8060125Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8061412Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8062760Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8064052Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8064388Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:24.8064464Z Autotune Choices Stats: 2025-12-04T10:01:24.8066114Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.8066642Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8067043Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8067748Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8069133Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8070472Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8071795Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8073165Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8074522Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8075855Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8077192Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8078615Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8079952Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.8081285Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8081607Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:24.8081753Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8081835Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8081898Z unimplemented [] 2025-12-04T10:01:24.8082003Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8082211Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8083601Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8083670Z graph_break [] 2025-12-04T10:01:24.8083840Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8083909Z Autotune Choices Stats: 2025-12-04T10:01:24.8085519Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.8085801Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8086089Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8086446Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8087792Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8089078Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8090364Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8091675Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8092963Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8094297Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8094575Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:24.8094642Z Autotune Choices Stats: 2025-12-04T10:01:24.8096306Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.8096876Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8097274Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8101706Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8103124Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8104479Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8105879Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8107301Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8108676Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8110006Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8111400Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8112722Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8114052Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8115373Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.8115712Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:24.8115860Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8115943Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8116009Z unimplemented [] 2025-12-04T10:01:24.8116134Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8116347Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8117784Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8117862Z graph_break [] 2025-12-04T10:01:24.8118009Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8118082Z Autotune Choices Stats: 2025-12-04T10:01:24.8119712Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.8120066Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8120322Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8120680Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8122020Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8123304Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8124593Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8125914Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8127227Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8128515Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8128800Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:24.8128872Z Autotune Choices Stats: 2025-12-04T10:01:24.8130566Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.8131123Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8131503Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8132149Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8133503Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8134867Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8136200Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8137569Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8138891Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8140260Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8141617Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8142937Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8144262Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8145619Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.8145910Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:24.8146052Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8146130Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8146195Z unimplemented [] 2025-12-04T10:01:24.8146303Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8146561Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8148047Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8148117Z graph_break [] 2025-12-04T10:01:24.8148254Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8148325Z Autotune Choices Stats: 2025-12-04T10:01:24.8150005Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.8150331Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8150579Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8150936Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8152240Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8153513Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8154833Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8156454Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8157849Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8159145Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8159486Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:24.8159557Z Autotune Choices Stats: 2025-12-04T10:01:24.8161266Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.8161794Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8162174Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8162823Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8164178Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8165586Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8166963Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8168291Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8169621Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8171017Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8172344Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8173676Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8175039Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8176371Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8176695Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:24.8176836Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8176909Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8176984Z unimplemented [] 2025-12-04T10:01:24.8177103Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8177319Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8178710Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8178820Z graph_break [] 2025-12-04T10:01:24.8178962Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8179032Z Autotune Choices Stats: 2025-12-04T10:01:24.8180684Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.8180975Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8181225Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8181582Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8182895Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8184230Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8185524Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8186858Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8188203Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8189526Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8189806Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:24.8189910Z Autotune Choices Stats: 2025-12-04T10:01:24.8191563Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.8192087Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8192457Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8193160Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8194510Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8195873Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8197210Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8198554Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8199958Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8201280Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8202616Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8203978Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8205314Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.8206672Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8206961Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:24.8207097Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8207169Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8207238Z unimplemented [] 2025-12-04T10:01:24.8207343Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8207559Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8208990Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8209057Z graph_break [] 2025-12-04T10:01:24.8209194Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8209297Z Autotune Choices Stats: 2025-12-04T10:01:24.8210905Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.8211194Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8211435Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8211793Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8213129Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8214413Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8215734Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8217022Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8218313Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8219658Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8219944Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:24.8220023Z Autotune Choices Stats: 2025-12-04T10:01:24.8221684Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.8222198Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8222603Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8223247Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8224600Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8225971Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8227385Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8228757Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8230139Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8231477Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8232803Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8234166Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8235532Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8236865Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.8237152Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:24.8237326Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8237398Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8237468Z unimplemented [] 2025-12-04T10:01:24.8237575Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8237784Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8239211Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8239281Z graph_break [] 2025-12-04T10:01:24.8239424Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8239493Z Autotune Choices Stats: 2025-12-04T10:01:24.8241096Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.8241378Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8241657Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8242015Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8243312Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8244628Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8245916Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8247210Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8248565Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8249844Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8250131Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:24.8250200Z Autotune Choices Stats: 2025-12-04T10:01:24.8251854Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.8252410Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8252779Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8253426Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8254817Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8256458Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8257901Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8259302Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8260628Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8261961Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8263351Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8264741Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8266080Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8267446Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.8267808Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:24.8267948Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8268017Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8268086Z unimplemented [] 2025-12-04T10:01:24.8268198Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8268440Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8269839Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8269907Z graph_break [] 2025-12-04T10:01:24.8270047Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8270114Z Autotune Choices Stats: 2025-12-04T10:01:24.8271726Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.8272051Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8272293Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8272653Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8273955Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8275279Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8276559Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8277893Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8279211Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8280496Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8280781Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:24.8280850Z Autotune Choices Stats: 2025-12-04T10:01:24.8282547Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:24.8283065Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8283431Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8284121Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8285476Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8286805Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8288208Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8289542Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8290867Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8292250Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8293581Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8294953Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8296281Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8297648Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.8297967Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:24.8298104Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8298175Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8298244Z unimplemented [] 2025-12-04T10:01:24.8298351Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8298557Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8299956Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8300019Z graph_break [] 2025-12-04T10:01:24.8300160Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8300229Z Autotune Choices Stats: 2025-12-04T10:01:24.8301874Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.8302162Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8302405Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8302772Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8304126Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8305420Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8306752Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8308118Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8309420Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8310704Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8311025Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:24.8311093Z Autotune Choices Stats: 2025-12-04T10:01:24.8312755Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.8313273Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8313678Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8314323Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8315677Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8317091Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8318429Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8319757Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8321089Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8322456Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8323812Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8325144Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8326467Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8327875Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8328169Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:24.8328306Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8328382Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8328465Z unimplemented [] 2025-12-04T10:01:24.8328572Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8328773Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8330167Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8330270Z graph_break [] 2025-12-04T10:01:24.8330413Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8330482Z Autotune Choices Stats: 2025-12-04T10:01:24.8332085Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.8332373Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8332653Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8333012Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8334307Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8335596Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8336950Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8338240Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8339523Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8340853Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8341143Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:24.8341209Z Autotune Choices Stats: 2025-12-04T10:01:24.8342901Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.8343423Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8343794Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8344446Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8345826Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8347189Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8348619Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8349943Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8351315Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8352682Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8354014Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8355655Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8357120Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8358455Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.8358747Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:24.8358888Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8358960Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8359028Z unimplemented [] 2025-12-04T10:01:24.8359136Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8359340Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8360799Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8360864Z graph_break [] 2025-12-04T10:01:24.8361003Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8361072Z Autotune Choices Stats: 2025-12-04T10:01:24.8362748Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.8363043Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8363286Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8363643Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8364943Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8366302Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8367601Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8368891Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8370172Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8371504Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8371789Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:24.8371861Z Autotune Choices Stats: 2025-12-04T10:01:24.8373546Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.8374063Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8374434Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8375109Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8376518Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8377859Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8379209Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8380571Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8381903Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8383262Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8384593Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8385955Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8387393Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8388746Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.8389033Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:24.8389205Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8389280Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8389349Z unimplemented [] 2025-12-04T10:01:24.8389456Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8389662Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8391052Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8391117Z graph_break [] 2025-12-04T10:01:24.8391259Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8391337Z Autotune Choices Stats: 2025-12-04T10:01:24.8392988Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.8393273Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8393512Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8393917Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8395208Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8396541Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8397841Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8399127Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8400686Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8402015Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8402304Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:24.8402372Z Autotune Choices Stats: 2025-12-04T10:01:24.8404035Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:24.8404599Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8404968Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8405644Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8407003Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8408345Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8409719Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8411044Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8412403Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8413731Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8415105Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8416488Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8417824Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8419162Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.8419480Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:24.8419619Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8419689Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8419757Z unimplemented [] 2025-12-04T10:01:24.8419862Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8420067Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8421510Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8421579Z graph_break [] 2025-12-04T10:01:24.8421717Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8421785Z Autotune Choices Stats: 2025-12-04T10:01:24.8423397Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.8423723Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8423968Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8424329Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8425660Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8426956Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8428301Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8429622Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8430914Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8432234Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8432517Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:24.8432586Z Autotune Choices Stats: 2025-12-04T10:01:24.8434235Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.8434798Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8435201Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8435849Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8437189Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8438519Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8439894Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8441264Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8442601Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8443925Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8445323Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8446649Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8447978Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8449351Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8449626Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:24.8449764Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8449833Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8449903Z unimplemented [] 2025-12-04T10:01:24.8450008Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8450211Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8451658Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8451722Z graph_break [] 2025-12-04T10:01:24.8451860Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8451930Z Autotune Choices Stats: 2025-12-04T10:01:24.8453535Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.8453863Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8454103Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8454501Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8456027Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8457318Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8458684Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8459974Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8461312Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8462606Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8462940Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:24.8463009Z Autotune Choices Stats: 2025-12-04T10:01:24.8464705Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.8465227Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8465600Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8466246Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8467669Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8469049Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8470415Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8471742Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8473075Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8474486Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8475814Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8477143Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8478467Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8479838Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8480115Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:24.8480264Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8480369Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8480440Z unimplemented [] 2025-12-04T10:01:24.8480556Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8480762Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8482150Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8482254Z graph_break [] 2025-12-04T10:01:24.8482393Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8482461Z Autotune Choices Stats: 2025-12-04T10:01:24.8484096Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.8484385Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8484628Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8484987Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8486283Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8487556Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8488893Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8490206Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8491488Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8492770Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8493099Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:24.8493168Z Autotune Choices Stats: 2025-12-04T10:01:24.8494856Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:24.8495376Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8495742Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8496385Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8497762Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8499087Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8500458Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8501786Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8503154Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8504518Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8505848Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8507189Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8508598Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8509969Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8510252Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:24.8510392Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8510460Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8510528Z unimplemented [] 2025-12-04T10:01:24.8510632Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8510836Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8512233Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8512337Z graph_break [] 2025-12-04T10:01:24.8512473Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8512540Z Autotune Choices Stats: 2025-12-04T10:01:24.8514173Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.8514468Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8514708Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8515071Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8516372Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8517699Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8519030Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8520314Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8521595Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8522914Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8523250Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:24.8523320Z Autotune Choices Stats: 2025-12-04T10:01:24.8524982Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.8525497Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8525871Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8526571Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8527919Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8529296Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8530643Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8531977Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8533377Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8534709Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8536032Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8537412Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8538741Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8540106Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8540384Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:24.8540526Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8540597Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8540700Z unimplemented [] 2025-12-04T10:01:24.8540809Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8541014Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8542413Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8542511Z graph_break [] 2025-12-04T10:01:24.8542652Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8542723Z Autotune Choices Stats: 2025-12-04T10:01:24.8544321Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_574", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.8544609Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8544856Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8545256Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8546559Z triton_flex_attention_574 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8547895Z triton_flex_attention_575 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8549225Z triton_flex_attention_572 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8550510Z triton_flex_attention_570 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8551834Z triton_flex_attention_573 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8553142Z triton_flex_attention_571 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8553429Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3286 seconds precompiling for 6 choices 2025-12-04T10:01:24.8553495Z Autotune Choices Stats: 2025-12-04T10:01:24.8555148Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014303999952971935, "best_triton_pos": 0} 2025-12-04T10:01:24.8555968Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8556349Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8556997Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8558428Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8559769Z triton_flex_attention_backward_576 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8561099Z triton_flex_attention_backward_577 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8562546Z triton_flex_attention_backward_578 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8563875Z triton_flex_attention_backward_581 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8565206Z triton_flex_attention_backward_583 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8566613Z triton_flex_attention_backward_580 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8567947Z triton_flex_attention_backward_582 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8569306Z triton_flex_attention_backward_585 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8570639Z triton_flex_attention_backward_588 0.0174 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.8570961Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2260 seconds precompiling for 13 choices 2025-12-04T10:01:24.8571099Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8571167Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8571229Z unimplemented [] 2025-12-04T10:01:24.8571339Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8571541Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8572974Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8573040Z graph_break [] 2025-12-04T10:01:24.8573180Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8573246Z Autotune Choices Stats: 2025-12-04T10:01:24.8574850Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.8575186Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8575429Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8575788Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8577092Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8578416Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8579707Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8580996Z triton_flex_attention_589 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8582351Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8583646Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8583926Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.2961 seconds precompiling for 6 choices 2025-12-04T10:01:24.8583992Z Autotune Choices Stats: 2025-12-04T10:01:24.8585643Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.8586200Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8586563Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8587287Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8588676Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8590013Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8591386Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8592746Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8594075Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8595396Z triton_flex_attention_backward_601 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8596767Z triton_flex_attention_backward_602 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8598136Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8599477Z triton_flex_attention_backward_604 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8600804Z triton_flex_attention_backward_603 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8601117Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.3242 seconds precompiling for 13 choices 2025-12-04T10:01:24.8601257Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8601327Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8601425Z unimplemented [] 2025-12-04T10:01:24.8601538Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8601744Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8603144Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8603206Z graph_break [] 2025-12-04T10:01:24.8603338Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8603410Z Autotune Choices Stats: 2025-12-04T10:01:24.8605013Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.8605341Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8605583Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8605965Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8607299Z triton_flex_attention_612 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8608589Z triton_flex_attention_613 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8609869Z triton_flex_attention_608 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8611223Z triton_flex_attention_610 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8612507Z triton_flex_attention_611 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8613795Z triton_flex_attention_609 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8614114Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3064 seconds precompiling for 6 choices 2025-12-04T10:01:24.8614183Z Autotune Choices Stats: 2025-12-04T10:01:24.8615834Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_616", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.8616378Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8616780Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8617433Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8618779Z triton_flex_attention_backward_616 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8620152Z triton_flex_attention_backward_615 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8621521Z triton_flex_attention_backward_617 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8622849Z triton_flex_attention_backward_614 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8624167Z triton_flex_attention_backward_619 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8625521Z triton_flex_attention_backward_620 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8626895Z triton_flex_attention_backward_618 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8628301Z triton_flex_attention_backward_621 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8629627Z triton_flex_attention_backward_623 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8631031Z triton_flex_attention_backward_622 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8631310Z SingleProcess AUTOTUNE benchmarking takes 0.6701 seconds and 2.3164 seconds precompiling for 13 choices 2025-12-04T10:01:24.8631451Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8631521Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8631585Z unimplemented [] 2025-12-04T10:01:24.8631696Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8631895Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8633290Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8633409Z graph_break [] 2025-12-04T10:01:24.8633542Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8633617Z Autotune Choices Stats: 2025-12-04T10:01:24.8635219Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.8635513Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8635754Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8636152Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8637440Z triton_flex_attention_631 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8638729Z triton_flex_attention_632 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8640045Z triton_flex_attention_629 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8641366Z triton_flex_attention_627 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8642659Z triton_flex_attention_630 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8643950Z triton_flex_attention_628 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8644272Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.2948 seconds precompiling for 6 choices 2025-12-04T10:01:24.8644339Z Autotune Choices Stats: 2025-12-04T10:01:24.8645987Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.8646534Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8646898Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8647563Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8648903Z triton_flex_attention_backward_634 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8650307Z triton_flex_attention_backward_635 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8651643Z triton_flex_attention_backward_636 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8652971Z triton_flex_attention_backward_633 0.0144 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8654336Z triton_flex_attention_backward_638 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8655944Z triton_flex_attention_backward_640 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8657342Z triton_flex_attention_backward_637 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8658674Z triton_flex_attention_backward_639 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8660063Z triton_flex_attention_backward_642 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8661434Z triton_flex_attention_backward_641 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8661730Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.2589 seconds precompiling for 13 choices 2025-12-04T10:01:24.8661874Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8661943Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8662006Z unimplemented [] 2025-12-04T10:01:24.8662115Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8662316Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8663712Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8663831Z graph_break [] 2025-12-04T10:01:24.8663965Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8664039Z Autotune Choices Stats: 2025-12-04T10:01:24.8665642Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.8665970Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8666213Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8666574Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8667960Z triton_flex_attention_650 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8669297Z triton_flex_attention_651 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8670629Z triton_flex_attention_646 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8671921Z triton_flex_attention_648 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8673208Z triton_flex_attention_649 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8674544Z triton_flex_attention_647 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8674830Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3235 seconds precompiling for 6 choices 2025-12-04T10:01:24.8674905Z Autotune Choices Stats: 2025-12-04T10:01:24.8676591Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_653", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.8677122Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8677484Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8678170Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8679547Z triton_flex_attention_backward_653 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8680888Z triton_flex_attention_backward_654 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8682216Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8683540Z triton_flex_attention_backward_652 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8684912Z triton_flex_attention_backward_656 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8686276Z triton_flex_attention_backward_657 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8687612Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8688974Z triton_flex_attention_backward_658 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8690340Z triton_flex_attention_backward_661 0.0164 ms 80.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8691676Z triton_flex_attention_backward_660 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8691956Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.4000 seconds precompiling for 13 choices 2025-12-04T10:01:24.8692098Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8692169Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8692270Z unimplemented [] 2025-12-04T10:01:24.8692381Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8692588Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8694004Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8694069Z graph_break [] 2025-12-04T10:01:24.8694204Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8694287Z Autotune Choices Stats: 2025-12-04T10:01:24.8695928Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.8696219Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8696459Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8696828Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8698120Z triton_flex_attention_669 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8699485Z triton_flex_attention_670 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8700778Z triton_flex_attention_665 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8702072Z triton_flex_attention_667 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8703399Z triton_flex_attention_668 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8704685Z triton_flex_attention_666 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8704966Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.2940 seconds precompiling for 6 choices 2025-12-04T10:01:24.8705073Z Autotune Choices Stats: 2025-12-04T10:01:24.8706734Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.8707308Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8707728Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8708391Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8709768Z triton_flex_attention_backward_673 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8711111Z triton_flex_attention_backward_674 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8712445Z triton_flex_attention_backward_671 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8713809Z triton_flex_attention_backward_672 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8715176Z triton_flex_attention_backward_676 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8716633Z triton_flex_attention_backward_675 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8718216Z triton_flex_attention_backward_677 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8719651Z triton_flex_attention_backward_678 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8720979Z triton_flex_attention_backward_680 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8722312Z triton_flex_attention_backward_683 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.8722628Z SingleProcess AUTOTUNE benchmarking takes 0.6679 seconds and 2.3879 seconds precompiling for 13 choices 2025-12-04T10:01:24.8722772Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8722841Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8722904Z unimplemented [] 2025-12-04T10:01:24.8723014Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8723214Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8724609Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8724673Z graph_break [] 2025-12-04T10:01:24.8724808Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8724921Z Autotune Choices Stats: 2025-12-04T10:01:24.8726721Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_689", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010367999784648418, "best_triton_pos": 0} 2025-12-04T10:01:24.8727065Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8727396Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8727784Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8729106Z triton_flex_attention_689 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8730393Z triton_flex_attention_688 0.0113 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8731671Z triton_flex_attention_684 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8732996Z triton_flex_attention_686 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8734281Z triton_flex_attention_687 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8735605Z triton_flex_attention_685 0.0164 ms 63.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8735881Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3120 seconds precompiling for 6 choices 2025-12-04T10:01:24.8735963Z Autotune Choices Stats: 2025-12-04T10:01:24.8737922Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_690", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.8738526Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8738883Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8739564Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8740896Z triton_flex_attention_backward_690 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8742230Z triton_flex_attention_backward_691 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8743600Z triton_flex_attention_backward_692 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8744928Z triton_flex_attention_backward_693 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8746314Z triton_flex_attention_backward_695 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8747698Z triton_flex_attention_backward_696 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8749079Z triton_flex_attention_backward_697 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8750439Z triton_flex_attention_backward_694 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8751772Z triton_flex_attention_backward_699 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8753103Z triton_flex_attention_backward_702 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.8753431Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.3417 seconds precompiling for 13 choices 2025-12-04T10:01:24.8753573Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8753642Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8753704Z unimplemented [] 2025-12-04T10:01:24.8753818Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8754020Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8755731Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8755811Z graph_break [] 2025-12-04T10:01:24.8755955Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8756028Z Autotune Choices Stats: 2025-12-04T10:01:24.8757631Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.8757975Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8758211Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8758573Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8759925Z triton_flex_attention_707 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8761215Z triton_flex_attention_708 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8762508Z triton_flex_attention_705 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8763871Z triton_flex_attention_706 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8765180Z triton_flex_attention_703 0.0144 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8766484Z triton_flex_attention_704 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8766761Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.2972 seconds precompiling for 6 choices 2025-12-04T10:01:24.8766836Z Autotune Choices Stats: 2025-12-04T10:01:24.8768478Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_709", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.8769079Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8769442Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8770090Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8771427Z triton_flex_attention_backward_709 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8772794Z triton_flex_attention_backward_710 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8774127Z triton_flex_attention_backward_711 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8775490Z triton_flex_attention_backward_712 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8776834Z triton_flex_attention_backward_714 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8778220Z triton_flex_attention_backward_715 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8779580Z triton_flex_attention_backward_713 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8780906Z triton_flex_attention_backward_716 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8782234Z triton_flex_attention_backward_721 0.0174 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.8783624Z triton_flex_attention_backward_717 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8783906Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.1509 seconds precompiling for 13 choices 2025-12-04T10:01:24.8784044Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8784114Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8784179Z unimplemented [] 2025-12-04T10:01:24.8784287Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8784523Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8785914Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8785983Z graph_break [] 2025-12-04T10:01:24.8786115Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8786189Z Autotune Choices Stats: 2025-12-04T10:01:24.8787914Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.8788205Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8788480Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8788842Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8790136Z triton_flex_attention_726 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8791417Z triton_flex_attention_727 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8792743Z triton_flex_attention_722 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8794026Z triton_flex_attention_724 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8795355Z triton_flex_attention_725 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8796655Z triton_flex_attention_723 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8799199Z SingleProcess AUTOTUNE benchmarking takes 0.2941 seconds and 1.2814 seconds precompiling for 6 choices 2025-12-04T10:01:24.8799282Z Autotune Choices Stats: 2025-12-04T10:01:24.8801007Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_729", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.8801542Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8801927Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8802567Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8803924Z triton_flex_attention_backward_729 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8805254Z triton_flex_attention_backward_730 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8806616Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8807946Z triton_flex_attention_backward_733 0.0144 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8809270Z triton_flex_attention_backward_728 0.0153 ms 87.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8810734Z triton_flex_attention_backward_735 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8812060Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8813381Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8814699Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8816029Z triton_flex_attention_backward_740 0.0173 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.8816357Z SingleProcess AUTOTUNE benchmarking takes 0.6682 seconds and 2.2393 seconds precompiling for 13 choices 2025-12-04T10:01:24.8816504Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8816582Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8816644Z unimplemented [] 2025-12-04T10:01:24.8816752Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8816962Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8818356Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8818470Z graph_break [] 2025-12-04T10:01:24.8818684Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8818751Z Autotune Choices Stats: 2025-12-04T10:01:24.8820390Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.8820685Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8820933Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8821288Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8822589Z triton_flex_attention_745 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8823869Z triton_flex_attention_746 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8825151Z triton_flex_attention_743 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8826472Z triton_flex_attention_741 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8827864Z triton_flex_attention_744 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8829195Z triton_flex_attention_742 0.0164 ms 62.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8829518Z SingleProcess AUTOTUNE benchmarking takes 0.2954 seconds and 1.3187 seconds precompiling for 6 choices 2025-12-04T10:01:24.8829586Z Autotune Choices Stats: 2025-12-04T10:01:24.8831274Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_750", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.8831799Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8832170Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8832807Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8834144Z triton_flex_attention_backward_750 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8835507Z triton_flex_attention_backward_748 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8836832Z triton_flex_attention_backward_749 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8838162Z triton_flex_attention_backward_753 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8839573Z triton_flex_attention_backward_747 0.0144 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8840898Z triton_flex_attention_backward_752 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8842210Z triton_flex_attention_backward_754 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8843528Z triton_flex_attention_backward_751 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8844848Z triton_flex_attention_backward_756 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8846202Z triton_flex_attention_backward_759 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.8846495Z SingleProcess AUTOTUNE benchmarking takes 0.6710 seconds and 2.3823 seconds precompiling for 13 choices 2025-12-04T10:01:24.8846634Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8846706Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8846774Z unimplemented [] 2025-12-04T10:01:24.8846881Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8847092Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8848521Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8848627Z graph_break [] 2025-12-04T10:01:24.8848759Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8848825Z Autotune Choices Stats: 2025-12-04T10:01:24.8850447Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_765", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010304000228643417, "best_triton_pos": 0} 2025-12-04T10:01:24.8850737Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8850982Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8851341Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8852630Z triton_flex_attention_765 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8853911Z triton_flex_attention_764 0.0113 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8855489Z triton_flex_attention_762 0.0133 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8856880Z triton_flex_attention_760 0.0143 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8858193Z triton_flex_attention_763 0.0143 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8859669Z triton_flex_attention_761 0.0154 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8859956Z SingleProcess AUTOTUNE benchmarking takes 0.2951 seconds and 1.3301 seconds precompiling for 6 choices 2025-12-04T10:01:24.8860028Z Autotune Choices Stats: 2025-12-04T10:01:24.8861690Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_767", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.8862212Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8862581Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8863219Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8864580Z triton_flex_attention_backward_767 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8866016Z triton_flex_attention_backward_769 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8867682Z triton_flex_attention_backward_766 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8869080Z triton_flex_attention_backward_768 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8870468Z triton_flex_attention_backward_771 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8871802Z triton_flex_attention_backward_772 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8873129Z triton_flex_attention_backward_770 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8874465Z triton_flex_attention_backward_773 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8875831Z triton_flex_attention_backward_775 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8877159Z triton_flex_attention_backward_778 0.0174 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.8877449Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2444 seconds precompiling for 13 choices 2025-12-04T10:01:24.8877627Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8877699Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8877771Z unimplemented [] 2025-12-04T10:01:24.8877922Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8878136Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8879580Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8879650Z graph_break [] 2025-12-04T10:01:24.8879790Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8879860Z Autotune Choices Stats: 2025-12-04T10:01:24.8881483Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_783", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.8881775Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8882021Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8882381Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8883692Z triton_flex_attention_783 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8885020Z triton_flex_attention_784 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8886315Z triton_flex_attention_779 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8887594Z triton_flex_attention_781 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8888998Z triton_flex_attention_782 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8890283Z triton_flex_attention_780 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8890572Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3189 seconds precompiling for 6 choices 2025-12-04T10:01:24.8890639Z Autotune Choices Stats: 2025-12-04T10:01:24.8892292Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_786", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.8892814Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8893192Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8893837Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8895241Z triton_flex_attention_backward_786 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8896572Z triton_flex_attention_backward_787 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8897909Z triton_flex_attention_backward_788 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8899348Z triton_flex_attention_backward_785 0.0145 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8900666Z triton_flex_attention_backward_790 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8902011Z triton_flex_attention_backward_791 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8903328Z triton_flex_attention_backward_792 0.0155 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8904795Z triton_flex_attention_backward_789 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8906175Z triton_flex_attention_backward_794 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8907564Z triton_flex_attention_backward_797 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.8907926Z SingleProcess AUTOTUNE benchmarking takes 0.6703 seconds and 2.2711 seconds precompiling for 13 choices 2025-12-04T10:01:24.8908119Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:24.8908198Z Traceback (most recent call last): 2025-12-04T10:01:24.8908557Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:24.8908622Z self.assertTrue( 2025-12-04T10:01:24.8909134Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:24.8909222Z raise self.failureException(msg) 2025-12-04T10:01:24.8909499Z AssertionError: False is not true : Log file /tmp/tmpokaaz2b9/flex_attention_configs.json was not created 2025-12-04T10:01:24.8909506Z 2025-12-04T10:01:24.8909655Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:24.8909951Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:24.8909957Z 2025-12-04T10:01:24.8910136Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:24.8910276Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8910357Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8910425Z unimplemented [] 2025-12-04T10:01:24.8910535Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8911943Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:24.8912160Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8912221Z graph_break [] 2025-12-04T10:01:24.8912359Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8913520Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:24.8913607Z current_size = base.storage().size() 2025-12-04T10:01:24.8913674Z Autotune Choices Stats: 2025-12-04T10:01:24.8915321Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:24.8915621Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8915858Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8916223Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8917552Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8918904Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8920191Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8921468Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8922744Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8924047Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8924341Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:24.8924409Z Autotune Choices Stats: 2025-12-04T10:01:24.8926068Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.8926627Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8927033Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8927725Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8929065Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8930383Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8931700Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8933011Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8934366Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8935688Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8937004Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8938428Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8939751Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8941074Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8941354Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:24.8941499Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8941569Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8941631Z unimplemented [] 2025-12-04T10:01:24.8941741Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8941946Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8943346Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8943445Z graph_break [] 2025-12-04T10:01:24.8943595Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8943663Z Autotune Choices Stats: 2025-12-04T10:01:24.8945262Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.8945590Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8945830Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8946228Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8947662Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8948964Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8950250Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8951541Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8952821Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8954141Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8954436Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:24.8954505Z Autotune Choices Stats: 2025-12-04T10:01:24.8956320Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.8956976Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8957396Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8958042Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8959406Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8960752Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8962078Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8963460Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8964795Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8966136Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8967598Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8968923Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8970247Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8971581Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8971874Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:24.8972022Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.8972097Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.8972167Z unimplemented [] 2025-12-04T10:01:24.8972276Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.8972485Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.8973931Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.8974000Z graph_break [] 2025-12-04T10:01:24.8974144Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.8974211Z Autotune Choices Stats: 2025-12-04T10:01:24.8975827Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.8976186Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8976429Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8976826Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8978124Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8979403Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8980681Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.8981975Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.8983294Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8984581Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8984870Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:24.8984976Z Autotune Choices Stats: 2025-12-04T10:01:24.8986631Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.8987276Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.8987652Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.8988301Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.8989653Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8990990Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8992321Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.8993682Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8995010Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.8996364Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.8997761Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.8999088Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9000414Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9001739Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9002019Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:24.9002160Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9002230Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9002294Z unimplemented [] 2025-12-04T10:01:24.9002443Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9002650Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9004040Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9004101Z graph_break [] 2025-12-04T10:01:24.9004249Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9004375Z Autotune Choices Stats: 2025-12-04T10:01:24.9005985Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.9006348Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9006591Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9006958Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9008255Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9009536Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9010834Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9012156Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9013440Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9014717Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9015080Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:24.9015149Z Autotune Choices Stats: 2025-12-04T10:01:24.9016835Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9017357Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9017730Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9018376Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9019717Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9021058Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9022430Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9023762Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9025117Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9026525Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9027908Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9029245Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9030573Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9031939Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9032228Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:24.9032375Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9032448Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9032513Z unimplemented [] 2025-12-04T10:01:24.9032633Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9032841Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9034252Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9034385Z graph_break [] 2025-12-04T10:01:24.9034523Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9034598Z Autotune Choices Stats: 2025-12-04T10:01:24.9036241Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.9036548Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9036794Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9037160Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9038452Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9039734Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9041017Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9042354Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9043637Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9044950Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9045303Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:24.9045373Z Autotune Choices Stats: 2025-12-04T10:01:24.9047023Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9047548Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9047911Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9048562Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9049906Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9051289Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9052634Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9053959Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9055696Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9057050Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9058388Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9059722Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9061041Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9062424Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9062712Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:24.9062857Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9062928Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9062990Z unimplemented [] 2025-12-04T10:01:24.9063105Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9063364Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9064758Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9064869Z graph_break [] 2025-12-04T10:01:24.9065045Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9065120Z Autotune Choices Stats: 2025-12-04T10:01:24.9066719Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.9067017Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9067307Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9067676Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9068966Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9070257Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9071571Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9072864Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9074140Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9075540Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9075828Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:24.9075898Z Autotune Choices Stats: 2025-12-04T10:01:24.9077552Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9078078Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9078443Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9079092Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9080443Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9081813Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9083156Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9084553Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9085923Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9087249Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9088588Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9089926Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9091301Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9092652Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9092939Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:24.9093124Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9093199Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9093300Z unimplemented [] 2025-12-04T10:01:24.9093416Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9093623Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9095049Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9095114Z graph_break [] 2025-12-04T10:01:24.9095248Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9095336Z Autotune Choices Stats: 2025-12-04T10:01:24.9096947Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.9097240Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9097482Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9097848Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9099152Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9100482Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9101774Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9103066Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9104449Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9105748Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9106041Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:24.9106109Z Autotune Choices Stats: 2025-12-04T10:01:24.9107831Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.9108360Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9108725Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9109379Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9110760Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9112096Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9113484Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9114901Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9116236Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9117565Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9118895Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9120255Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9121587Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9122927Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9123297Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:24.9123439Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9123508Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9123569Z unimplemented [] 2025-12-04T10:01:24.9123680Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9123929Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9125320Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9125386Z graph_break [] 2025-12-04T10:01:24.9125519Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9125592Z Autotune Choices Stats: 2025-12-04T10:01:24.9127195Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:24.9127487Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9127727Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9128094Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9129416Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9130699Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9131984Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9133345Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9134658Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9135946Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9136237Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:24.9136310Z Autotune Choices Stats: 2025-12-04T10:01:24.9137966Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9138494Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9138855Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9139540Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9140882Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9142213Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9143647Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9144976Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9146333Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9147724Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9149060Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9150428Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9151754Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9153140Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9153493Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:24.9153640Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9153710Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9153774Z unimplemented [] 2025-12-04T10:01:24.9153883Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9154089Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9155788Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9155860Z graph_break [] 2025-12-04T10:01:24.9155998Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9156071Z Autotune Choices Stats: 2025-12-04T10:01:24.9157675Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:24.9157970Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9158211Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9158575Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9159939Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9161229Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9162570Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9163948Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9165233Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9166523Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9166814Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:24.9166891Z Autotune Choices Stats: 2025-12-04T10:01:24.9168537Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9169103Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9169471Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9170128Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9171468Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9172900Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9174238Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9175564Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9176895Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9178216Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9179585Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9180907Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9182264Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9183676Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9183961Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:24.9184106Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9184176Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9184240Z unimplemented [] 2025-12-04T10:01:24.9184354Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9184556Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9185939Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9186016Z graph_break [] 2025-12-04T10:01:24.9186153Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9186227Z Autotune Choices Stats: 2025-12-04T10:01:24.9187892Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.9188189Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9188490Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9188852Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9190142Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9191420Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9192807Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9194089Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9195366Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9196655Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9196936Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:24.9197008Z Autotune Choices Stats: 2025-12-04T10:01:24.9198699Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9199228Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9199590Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9200238Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9201607Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9203016Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9204344Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9205678Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9207008Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9208354Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9209687Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9211014Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9212442Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9213769Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9214052Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:24.9214190Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9214257Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9214322Z unimplemented [] 2025-12-04T10:01:24.9214431Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9214634Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9216046Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9216117Z graph_break [] 2025-12-04T10:01:24.9216251Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9216326Z Autotune Choices Stats: 2025-12-04T10:01:24.9217969Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:24.9218262Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9218498Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9218856Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9220157Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9221559Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9222850Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9224131Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9225419Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9226718Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9226996Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:24.9227072Z Autotune Choices Stats: 2025-12-04T10:01:24.9228815Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9229343Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9229704Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9230419Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9231815Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9233179Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9234524Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9235852Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9237187Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9238551Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9239884Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9241308Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9242654Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9243985Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9244263Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:24.9244398Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9244476Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9244540Z unimplemented [] 2025-12-04T10:01:24.9249987Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9250256Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9251675Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9251750Z graph_break [] 2025-12-04T10:01:24.9251899Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9251979Z Autotune Choices Stats: 2025-12-04T10:01:24.9253678Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.9253992Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9254238Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9254642Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9256197Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9257606Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9258913Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9260221Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9261513Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9262864Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9263161Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:24.9263246Z Autotune Choices Stats: 2025-12-04T10:01:24.9264921Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9265514Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9265958Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9266667Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9268104Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9269462Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9270804Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9272161Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9273547Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9274889Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9276273Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9277681Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9279022Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9280366Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9280661Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:24.9280808Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9280888Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9280955Z unimplemented [] 2025-12-04T10:01:24.9281073Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9281287Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9282757Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9282833Z graph_break [] 2025-12-04T10:01:24.9282973Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9283049Z Autotune Choices Stats: 2025-12-04T10:01:24.9284686Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:24.9285020Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9285307Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9285672Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9287037Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9288354Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9289664Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9290972Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9292269Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9293604Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9293898Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:24.9293975Z Autotune Choices Stats: 2025-12-04T10:01:24.9295648Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.9296249Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9296650Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9297316Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9298673Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9300028Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9301367Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9302746Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9304100Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9305450Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9306926Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9308324Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9309671Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9311030Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9311321Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:24.9311462Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9311540Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9311605Z unimplemented [] 2025-12-04T10:01:24.9311715Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9311933Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9313390Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9313468Z graph_break [] 2025-12-04T10:01:24.9313606Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9313683Z Autotune Choices Stats: 2025-12-04T10:01:24.9315303Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.9315669Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9315942Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9316306Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9317760Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9319073Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9320380Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9321679Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9323011Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9324316Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9324650Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:24.9324735Z Autotune Choices Stats: 2025-12-04T10:01:24.9326470Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9327000Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9327366Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9328027Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9329376Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9330726Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9332097Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9333450Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9334785Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9336222Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9337561Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9338906Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9340247Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9341592Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9341898Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:24.9342101Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9342187Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9342253Z unimplemented [] 2025-12-04T10:01:24.9342363Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9342577Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9343988Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9344095Z graph_break [] 2025-12-04T10:01:24.9344233Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9344342Z Autotune Choices Stats: 2025-12-04T10:01:24.9346005Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.9346298Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9346544Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9346903Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9348256Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9349543Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9350842Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9352175Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9353464Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9354752Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9355103Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:24.9355178Z Autotune Choices Stats: 2025-12-04T10:01:24.9357223Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.9357759Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9358126Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9358775Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9360125Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9361470Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9362859Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9364202Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9365593Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9367022Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9368359Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9369696Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9371029Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9372393Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9372689Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:24.9372830Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9372910Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9372976Z unimplemented [] 2025-12-04T10:01:24.9373089Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9373301Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9374697Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9374839Z graph_break [] 2025-12-04T10:01:24.9374980Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9375050Z Autotune Choices Stats: 2025-12-04T10:01:24.9376723Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.9377023Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9377267Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9377629Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9378943Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9380229Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9381560Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9382862Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9384151Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9385553Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9385840Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:24.9385916Z Autotune Choices Stats: 2025-12-04T10:01:24.9387650Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9388183Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9388549Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9389207Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9390548Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9391932Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9393265Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9394642Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9396033Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9397375Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9398723Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9400064Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9401437Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9402763Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9403052Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:24.9403189Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9403270Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9403374Z unimplemented [] 2025-12-04T10:01:24.9403486Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9403704Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9405175Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9405246Z graph_break [] 2025-12-04T10:01:24.9405381Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9405459Z Autotune Choices Stats: 2025-12-04T10:01:24.9407081Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.9407372Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9407621Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9407978Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9409287Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9410591Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9411932Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9413229Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9414554Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9415942Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9416232Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:24.9416306Z Autotune Choices Stats: 2025-12-04T10:01:24.9417963Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9418490Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9418855Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9419518Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9420912Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9422264Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9423604Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9425246Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9426583Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9427985Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9429317Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9430645Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9432019Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9433348Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9433672Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:24.9433845Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9433920Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9433984Z unimplemented [] 2025-12-04T10:01:24.9434102Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9434316Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9435746Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9435819Z graph_break [] 2025-12-04T10:01:24.9435955Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9436022Z Autotune Choices Stats: 2025-12-04T10:01:24.9437637Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.9437925Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9438177Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9438538Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9439843Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9441160Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9442451Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9443784Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9445160Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9446466Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9446745Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:24.9446817Z Autotune Choices Stats: 2025-12-04T10:01:24.9448471Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9449002Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9449365Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9450021Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9451398Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9452759Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9454208Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9455838Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9457187Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9458522Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9459860Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9461265Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9462602Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9463928Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9464316Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:24.9464453Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9464573Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9464640Z unimplemented [] 2025-12-04T10:01:24.9464748Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9464958Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9466352Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9466421Z graph_break [] 2025-12-04T10:01:24.9466556Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9466626Z Autotune Choices Stats: 2025-12-04T10:01:24.9468346Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.9468639Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9468885Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9469245Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9470594Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9471887Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9473177Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9474561Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9475866Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9477164Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9477444Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:24.9477521Z Autotune Choices Stats: 2025-12-04T10:01:24.9479182Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.9479708Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9480121Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9480773Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9482122Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9483499Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9484901Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9486244Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9487576Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9488909Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9490296Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9491629Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9492976Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9494406Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9494694Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:24.9494832Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9494906Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9494970Z unimplemented [] 2025-12-04T10:01:24.9495078Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9495287Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9496687Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9496764Z graph_break [] 2025-12-04T10:01:24.9496901Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9496970Z Autotune Choices Stats: 2025-12-04T10:01:24.9498585Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.9498873Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9499118Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9499516Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9500818Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9502103Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9503497Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9504781Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9506083Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9507441Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9507728Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:24.9507797Z Autotune Choices Stats: 2025-12-04T10:01:24.9509450Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9510018Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9510382Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9511033Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9512390Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9513826Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9515157Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9516493Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9517824Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9519156Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9520519Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9521856Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9523260Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9524658Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9524951Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:24.9525085Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9525174Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9525238Z unimplemented [] 2025-12-04T10:01:24.9525348Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9525562Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9526954Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9527028Z graph_break [] 2025-12-04T10:01:24.9527163Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9527231Z Autotune Choices Stats: 2025-12-04T10:01:24.9528848Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.9529180Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9529433Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9529793Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9531095Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9532422Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9533775Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9535063Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9536378Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9537665Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9537948Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:24.9538025Z Autotune Choices Stats: 2025-12-04T10:01:24.9539721Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.9540251Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9540618Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9541304Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9542713Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9544061Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9545403Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9546746Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9548134Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9549511Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9550857Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9552229Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9553631Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9554960Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9555462Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:24.9555659Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9555740Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9555808Z unimplemented [] 2025-12-04T10:01:24.9555921Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9556148Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9557551Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9557621Z graph_break [] 2025-12-04T10:01:24.9557761Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9557830Z Autotune Choices Stats: 2025-12-04T10:01:24.9559522Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.9559818Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9560068Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9560428Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9561791Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9563198Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9564512Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9565811Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9567103Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9568400Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9568720Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:24.9568796Z Autotune Choices Stats: 2025-12-04T10:01:24.9570454Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:24.9570975Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9571379Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9572061Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9573452Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9574803Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9576134Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9577467Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9578829Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9580159Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9581485Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9582909Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9584239Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9585578Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9585866Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:24.9586004Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9586083Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9586146Z unimplemented [] 2025-12-04T10:01:24.9586253Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9586477Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9587920Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9587991Z graph_break [] 2025-12-04T10:01:24.9588170Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9588242Z Autotune Choices Stats: 2025-12-04T10:01:24.9589854Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.9590140Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9590428Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9590824Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9592166Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9593459Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9594755Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9596046Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9597340Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9598693Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9598981Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:24.9599051Z Autotune Choices Stats: 2025-12-04T10:01:24.9600705Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:24.9601329Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9601735Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9602385Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9603739Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9605071Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9606407Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9607782Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9609113Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9610446Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9611880Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9613221Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9614563Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9615898Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9616190Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:24.9616325Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9616397Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9616467Z unimplemented [] 2025-12-04T10:01:24.9616574Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9616786Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9618217Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9618292Z graph_break [] 2025-12-04T10:01:24.9618427Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9618500Z Autotune Choices Stats: 2025-12-04T10:01:24.9620110Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.9620463Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9620714Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9621108Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9622410Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9623689Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9624988Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9626287Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9627674Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9628975Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9629254Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:24.9629320Z Autotune Choices Stats: 2025-12-04T10:01:24.9631007Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9631595Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9631967Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9632617Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9633961Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9635300Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9636648Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9638040Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9639368Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9640731Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9642140Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9643476Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9644816Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9646160Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9646451Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:24.9646591Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9646664Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9646736Z unimplemented [] 2025-12-04T10:01:24.9646877Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9647089Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9648477Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9648540Z graph_break [] 2025-12-04T10:01:24.9648683Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9648751Z Autotune Choices Stats: 2025-12-04T10:01:24.9650406Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.9650759Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9651004Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9651360Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9652660Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9653937Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9655449Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9656907Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9658206Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9659498Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9659881Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:24.9659950Z Autotune Choices Stats: 2025-12-04T10:01:24.9661655Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9662182Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9662556Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9663211Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9664566Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9665913Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9667348Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9668695Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9670024Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9671475Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9672812Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9674139Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9675471Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9676813Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9677159Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:24.9677304Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9677375Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9677447Z unimplemented [] 2025-12-04T10:01:24.9677555Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9677770Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9679162Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9679300Z graph_break [] 2025-12-04T10:01:24.9679441Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9679509Z Autotune Choices Stats: 2025-12-04T10:01:24.9681149Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:24.9681443Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9681698Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9682060Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9683367Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9684659Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9685970Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9687301Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9688600Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9689925Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9690248Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:24.9690352Z Autotune Choices Stats: 2025-12-04T10:01:24.9692019Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:24.9692547Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9692917Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9693573Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9694938Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9696320Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9697663Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9698991Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9700423Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9701760Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9703084Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9704413Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9705747Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9707112Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9707455Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:24.9707597Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9707668Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9707738Z unimplemented [] 2025-12-04T10:01:24.9707845Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9708056Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9709507Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9709624Z graph_break [] 2025-12-04T10:01:24.9709765Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9709866Z Autotune Choices Stats: 2025-12-04T10:01:24.9711479Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.9711772Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9712018Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9712376Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9713680Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9714977Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9716320Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9717608Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9718898Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9720287Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9720575Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:24.9720644Z Autotune Choices Stats: 2025-12-04T10:01:24.9722303Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.9722827Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9723195Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9723844Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9725198Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9726583Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9727924Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9729294Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9730688Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9732027Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9733357Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9734691Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9736072Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9737399Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9737685Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:24.9737855Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9737927Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9738039Z unimplemented [] 2025-12-04T10:01:24.9738150Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9738357Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9739788Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9739852Z graph_break [] 2025-12-04T10:01:24.9739993Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9740063Z Autotune Choices Stats: 2025-12-04T10:01:24.9741684Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.9741971Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9742218Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9742577Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9743877Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9745207Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9746506Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9747840Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9749265Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9750555Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9750843Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:24.9750913Z Autotune Choices Stats: 2025-12-04T10:01:24.9752585Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.9753105Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9753475Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9754119Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9755802Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9757171Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9758580Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9760016Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9761347Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9762684Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9764015Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9765391Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9766728Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9768062Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9768423Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:24.9768561Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9768631Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9768703Z unimplemented [] 2025-12-04T10:01:24.9768808Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9769050Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9770450Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9770516Z graph_break [] 2025-12-04T10:01:24.9770659Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9770726Z Autotune Choices Stats: 2025-12-04T10:01:24.9772341Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.9772626Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9772872Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9773233Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9774527Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9775850Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9777149Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9778494Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9779818Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9781107Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9781391Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:24.9781458Z Autotune Choices Stats: 2025-12-04T10:01:24.9783115Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:24.9783636Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9783999Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9784705Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9786060Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9787469Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9788922Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9790257Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9791595Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9792922Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9794253Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9795616Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9796942Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9798306Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9798658Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:24.9798795Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9798875Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9798946Z unimplemented [] 2025-12-04T10:01:24.9799056Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9799265Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9800669Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9800732Z graph_break [] 2025-12-04T10:01:24.9800873Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9800940Z Autotune Choices Stats: 2025-12-04T10:01:24.9802547Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9802833Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9803074Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9803438Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9804775Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9806076Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9807416Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9808768Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9810061Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9811348Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9811630Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:24.9811699Z Autotune Choices Stats: 2025-12-04T10:01:24.9813358Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.9813874Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9814285Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9814937Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9816285Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9817741Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9819078Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9820412Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9821765Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9823127Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9824491Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9825837Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9827165Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9828649Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9828946Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:24.9829086Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9829158Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9829229Z unimplemented [] 2025-12-04T10:01:24.9829337Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9829543Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9830944Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9831011Z graph_break [] 2025-12-04T10:01:24.9831153Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9831226Z Autotune Choices Stats: 2025-12-04T10:01:24.9832841Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_574", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9833126Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9833407Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9833776Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9835084Z triton_flex_attention_574 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9836379Z triton_flex_attention_575 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9837790Z triton_flex_attention_572 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9839081Z triton_flex_attention_570 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9840375Z triton_flex_attention_573 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9841669Z triton_flex_attention_571 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9841954Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3286 seconds precompiling for 6 choices 2025-12-04T10:01:24.9842023Z Autotune Choices Stats: 2025-12-04T10:01:24.9843715Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014303999952971935, "best_triton_pos": 0} 2025-12-04T10:01:24.9844234Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9844604Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9845244Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9846623Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9848034Z triton_flex_attention_backward_576 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9849371Z triton_flex_attention_backward_577 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9850699Z triton_flex_attention_backward_578 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9852034Z triton_flex_attention_backward_581 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9853398Z triton_flex_attention_backward_583 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9854732Z triton_flex_attention_backward_580 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9856272Z triton_flex_attention_backward_582 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9857795Z triton_flex_attention_backward_585 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9859129Z triton_flex_attention_backward_588 0.0174 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:24.9859421Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2260 seconds precompiling for 13 choices 2025-12-04T10:01:24.9859559Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9859631Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9859704Z unimplemented [] 2025-12-04T10:01:24.9859809Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9860012Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9861421Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9861486Z graph_break [] 2025-12-04T10:01:24.9861628Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9861698Z Autotune Choices Stats: 2025-12-04T10:01:24.9863367Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9863656Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9863899Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9864267Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9865565Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9866964Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9868341Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9869628Z triton_flex_attention_589 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9870918Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9872209Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9872494Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.2961 seconds precompiling for 6 choices 2025-12-04T10:01:24.9872563Z Autotune Choices Stats: 2025-12-04T10:01:24.9874259Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.9874780Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9875150Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9875864Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9877272Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9878607Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9879947Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9881282Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9882618Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9883988Z triton_flex_attention_backward_601 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9885319Z triton_flex_attention_backward_602 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9886721Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9888084Z triton_flex_attention_backward_604 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9889423Z triton_flex_attention_backward_603 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9889705Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.3242 seconds precompiling for 13 choices 2025-12-04T10:01:24.9889846Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9889916Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9889985Z unimplemented [] 2025-12-04T10:01:24.9890091Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9890298Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9891698Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9891760Z graph_break [] 2025-12-04T10:01:24.9891901Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9891973Z Autotune Choices Stats: 2025-12-04T10:01:24.9893613Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:24.9893904Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9894145Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9894564Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9895917Z triton_flex_attention_612 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9897240Z triton_flex_attention_613 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9898543Z triton_flex_attention_608 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9899836Z triton_flex_attention_610 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9901136Z triton_flex_attention_611 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9902451Z triton_flex_attention_609 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9902741Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3064 seconds precompiling for 6 choices 2025-12-04T10:01:24.9902813Z Autotune Choices Stats: 2025-12-04T10:01:24.9904476Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_616", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.9905033Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9905435Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9906110Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9907513Z triton_flex_attention_backward_616 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9908870Z triton_flex_attention_backward_615 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9910217Z triton_flex_attention_backward_617 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9911544Z triton_flex_attention_backward_614 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9912916Z triton_flex_attention_backward_619 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9914257Z triton_flex_attention_backward_620 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9915623Z triton_flex_attention_backward_618 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9917039Z triton_flex_attention_backward_621 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9918365Z triton_flex_attention_backward_623 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9919704Z triton_flex_attention_backward_622 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9919988Z SingleProcess AUTOTUNE benchmarking takes 0.6701 seconds and 2.3164 seconds precompiling for 13 choices 2025-12-04T10:01:24.9920128Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9920197Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9920266Z unimplemented [] 2025-12-04T10:01:24.9920372Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9920574Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9922012Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9922080Z graph_break [] 2025-12-04T10:01:24.9922218Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9922285Z Autotune Choices Stats: 2025-12-04T10:01:24.9923883Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9924209Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9924485Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9924844Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9926171Z triton_flex_attention_631 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9927464Z triton_flex_attention_632 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9928755Z triton_flex_attention_629 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9930050Z triton_flex_attention_627 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9931344Z triton_flex_attention_630 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9932685Z triton_flex_attention_628 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9932976Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.2948 seconds precompiling for 6 choices 2025-12-04T10:01:24.9933044Z Autotune Choices Stats: 2025-12-04T10:01:24.9934694Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:24.9935276Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9935913Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9936567Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9937912Z triton_flex_attention_backward_634 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9939245Z triton_flex_attention_backward_635 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9940583Z triton_flex_attention_backward_636 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9941951Z triton_flex_attention_backward_633 0.0144 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9943293Z triton_flex_attention_backward_638 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9944622Z triton_flex_attention_backward_640 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9946045Z triton_flex_attention_backward_637 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9947425Z triton_flex_attention_backward_639 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9948752Z triton_flex_attention_backward_642 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9950092Z triton_flex_attention_backward_641 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9950376Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.2589 seconds precompiling for 13 choices 2025-12-04T10:01:24.9950519Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9950592Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9950661Z unimplemented [] 2025-12-04T10:01:24.9950768Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9950973Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9952415Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9952481Z graph_break [] 2025-12-04T10:01:24.9952623Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9952690Z Autotune Choices Stats: 2025-12-04T10:01:24.9954295Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9954655Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9954928Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9955554Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9956881Z triton_flex_attention_650 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9958173Z triton_flex_attention_651 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9959458Z triton_flex_attention_646 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9960758Z triton_flex_attention_648 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9962122Z triton_flex_attention_649 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9963414Z triton_flex_attention_647 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9963757Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3235 seconds precompiling for 6 choices 2025-12-04T10:01:24.9963830Z Autotune Choices Stats: 2025-12-04T10:01:24.9965604Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_653", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.9966133Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9966510Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9967155Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9968508Z triton_flex_attention_backward_653 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9969844Z triton_flex_attention_backward_654 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9971227Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9972569Z triton_flex_attention_backward_652 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9973907Z triton_flex_attention_backward_656 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9975349Z triton_flex_attention_backward_657 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9976674Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:24.9978014Z triton_flex_attention_backward_658 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:24.9979340Z triton_flex_attention_backward_661 0.0164 ms 80.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9980684Z triton_flex_attention_backward_660 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:24.9980964Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.4000 seconds precompiling for 13 choices 2025-12-04T10:01:24.9981145Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:24.9981219Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:24.9981283Z unimplemented [] 2025-12-04T10:01:24.9981395Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:24.9981598Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:24.9983004Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:24.9983104Z graph_break [] 2025-12-04T10:01:24.9983246Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:24.9983316Z Autotune Choices Stats: 2025-12-04T10:01:24.9984989Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:24.9985280Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9985522Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9985892Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9987202Z triton_flex_attention_669 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9988568Z triton_flex_attention_670 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9989855Z triton_flex_attention_665 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9991191Z triton_flex_attention_667 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:24.9992488Z triton_flex_attention_668 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:24.9993776Z triton_flex_attention_666 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:24.9994125Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.2940 seconds precompiling for 6 choices 2025-12-04T10:01:24.9994193Z Autotune Choices Stats: 2025-12-04T10:01:24.9995871Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:24.9996399Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:24.9996769Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:24.9997414Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:24.9998761Z triton_flex_attention_backward_673 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0000092Z triton_flex_attention_backward_674 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0001458Z triton_flex_attention_backward_671 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0002793Z triton_flex_attention_backward_672 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0004160Z triton_flex_attention_backward_676 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0005570Z triton_flex_attention_backward_675 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0006915Z triton_flex_attention_backward_677 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0008250Z triton_flex_attention_backward_678 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0009579Z triton_flex_attention_backward_680 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0010949Z triton_flex_attention_backward_683 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.0011232Z SingleProcess AUTOTUNE benchmarking takes 0.6679 seconds and 2.3879 seconds precompiling for 13 choices 2025-12-04T10:01:25.0011373Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.0011442Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.0011504Z unimplemented [] 2025-12-04T10:01:25.0011617Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.0011822Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.0013221Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.0013354Z graph_break [] 2025-12-04T10:01:25.0013493Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.0013566Z Autotune Choices Stats: 2025-12-04T10:01:25.0015199Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_689", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010367999784648418, "best_triton_pos": 0} 2025-12-04T10:01:25.0015493Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0015732Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0016101Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0017396Z triton_flex_attention_689 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0018695Z triton_flex_attention_688 0.0113 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0020017Z triton_flex_attention_684 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0021312Z triton_flex_attention_686 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.0022610Z triton_flex_attention_687 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.0024002Z triton_flex_attention_685 0.0164 ms 63.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0024288Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3120 seconds precompiling for 6 choices 2025-12-04T10:01:25.0024356Z Autotune Choices Stats: 2025-12-04T10:01:25.0026016Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_690", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.0026542Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0026907Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0027600Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0028937Z triton_flex_attention_backward_690 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0030326Z triton_flex_attention_backward_691 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0031673Z triton_flex_attention_backward_692 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0033011Z triton_flex_attention_backward_693 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0034444Z triton_flex_attention_backward_695 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0035793Z triton_flex_attention_backward_696 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0037143Z triton_flex_attention_backward_697 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0038471Z triton_flex_attention_backward_694 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0039811Z triton_flex_attention_backward_699 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0041179Z triton_flex_attention_backward_702 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.0041465Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.3417 seconds precompiling for 13 choices 2025-12-04T10:01:25.0041606Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.0041676Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.0041802Z unimplemented [] 2025-12-04T10:01:25.0041913Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.0042120Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.0043576Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.0043639Z graph_break [] 2025-12-04T10:01:25.0043775Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.0043848Z Autotune Choices Stats: 2025-12-04T10:01:25.0045463Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.0045769Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0046014Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0046377Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0047678Z triton_flex_attention_707 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0048971Z triton_flex_attention_708 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0050295Z triton_flex_attention_705 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.0051589Z triton_flex_attention_706 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.0052924Z triton_flex_attention_703 0.0144 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0054283Z triton_flex_attention_704 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0054574Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.2972 seconds precompiling for 6 choices 2025-12-04T10:01:25.0054643Z Autotune Choices Stats: 2025-12-04T10:01:25.0056559Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_709", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.0057097Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0057461Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0058111Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0059518Z triton_flex_attention_backward_709 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0060866Z triton_flex_attention_backward_710 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0062201Z triton_flex_attention_backward_711 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0063713Z triton_flex_attention_backward_712 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0065057Z triton_flex_attention_backward_714 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0066393Z triton_flex_attention_backward_715 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0067792Z triton_flex_attention_backward_713 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0069129Z triton_flex_attention_backward_716 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0070496Z triton_flex_attention_backward_721 0.0174 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.0071826Z triton_flex_attention_backward_717 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0072142Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.1509 seconds precompiling for 13 choices 2025-12-04T10:01:25.0072345Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.0072417Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.0072480Z unimplemented [] 2025-12-04T10:01:25.0072592Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.0072798Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.0074237Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.0074307Z graph_break [] 2025-12-04T10:01:25.0074441Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.0074515Z Autotune Choices Stats: 2025-12-04T10:01:25.0076123Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.0076414Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0076657Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0077025Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0078329Z triton_flex_attention_726 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0079673Z triton_flex_attention_727 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0080968Z triton_flex_attention_722 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0082296Z triton_flex_attention_724 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.0083643Z triton_flex_attention_725 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.0084940Z triton_flex_attention_723 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0085230Z SingleProcess AUTOTUNE benchmarking takes 0.2941 seconds and 1.2814 seconds precompiling for 6 choices 2025-12-04T10:01:25.0085299Z Autotune Choices Stats: 2025-12-04T10:01:25.0086957Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_729", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.0087483Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0087850Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0088503Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0089872Z triton_flex_attention_backward_729 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0091226Z triton_flex_attention_backward_730 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0092623Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0093980Z triton_flex_attention_backward_733 0.0144 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0095317Z triton_flex_attention_backward_728 0.0153 ms 87.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0096643Z triton_flex_attention_backward_735 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0097989Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0099359Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0100690Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0102029Z triton_flex_attention_backward_740 0.0173 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.0102371Z SingleProcess AUTOTUNE benchmarking takes 0.6682 seconds and 2.2393 seconds precompiling for 13 choices 2025-12-04T10:01:25.0102513Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.0102617Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.0102683Z unimplemented [] 2025-12-04T10:01:25.0102797Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.0103002Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.0104398Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.0104465Z graph_break [] 2025-12-04T10:01:25.0104600Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.0104674Z Autotune Choices Stats: 2025-12-04T10:01:25.0106289Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.0106588Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0106829Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0107203Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0108606Z triton_flex_attention_745 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0109903Z triton_flex_attention_746 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0111193Z triton_flex_attention_743 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.0112605Z triton_flex_attention_741 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0113894Z triton_flex_attention_744 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.0115197Z triton_flex_attention_742 0.0164 ms 62.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0115486Z SingleProcess AUTOTUNE benchmarking takes 0.2954 seconds and 1.3187 seconds precompiling for 6 choices 2025-12-04T10:01:25.0115569Z Autotune Choices Stats: 2025-12-04T10:01:25.0117227Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_750", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.0117751Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0118152Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0118811Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0120152Z triton_flex_attention_backward_750 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0121523Z triton_flex_attention_backward_748 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0122929Z triton_flex_attention_backward_749 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0124263Z triton_flex_attention_backward_753 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0125593Z triton_flex_attention_backward_747 0.0144 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0126919Z triton_flex_attention_backward_752 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0128286Z triton_flex_attention_backward_754 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0129625Z triton_flex_attention_backward_751 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0130954Z triton_flex_attention_backward_756 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0132376Z triton_flex_attention_backward_759 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.0132660Z SingleProcess AUTOTUNE benchmarking takes 0.6710 seconds and 2.3823 seconds precompiling for 13 choices 2025-12-04T10:01:25.0132802Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.0132885Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.0132951Z unimplemented [] 2025-12-04T10:01:25.0133065Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.0133271Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.0134678Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.0134742Z graph_break [] 2025-12-04T10:01:25.0134874Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.0134948Z Autotune Choices Stats: 2025-12-04T10:01:25.0136535Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_765", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010304000228643417, "best_triton_pos": 0} 2025-12-04T10:01:25.0136827Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0137062Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0137474Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0138760Z triton_flex_attention_765 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0140058Z triton_flex_attention_764 0.0113 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0141404Z triton_flex_attention_762 0.0133 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.0142725Z triton_flex_attention_760 0.0143 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0144015Z triton_flex_attention_763 0.0143 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.0145312Z triton_flex_attention_761 0.0154 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0145592Z SingleProcess AUTOTUNE benchmarking takes 0.2951 seconds and 1.3301 seconds precompiling for 6 choices 2025-12-04T10:01:25.0145678Z Autotune Choices Stats: 2025-12-04T10:01:25.0147369Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_767", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.0147934Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0148300Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0148956Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0150299Z triton_flex_attention_backward_767 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0151775Z triton_flex_attention_backward_769 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0153105Z triton_flex_attention_backward_766 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0154441Z triton_flex_attention_backward_768 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0156044Z triton_flex_attention_backward_771 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0157383Z triton_flex_attention_backward_772 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0158798Z triton_flex_attention_backward_770 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0160147Z triton_flex_attention_backward_773 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0161531Z triton_flex_attention_backward_775 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0162964Z triton_flex_attention_backward_778 0.0174 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.0163251Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2444 seconds precompiling for 13 choices 2025-12-04T10:01:25.0163396Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.0163467Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.0163532Z unimplemented [] 2025-12-04T10:01:25.0163644Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.0163850Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.0165251Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.0165317Z graph_break [] 2025-12-04T10:01:25.0165451Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.0165526Z Autotune Choices Stats: 2025-12-04T10:01:25.0167146Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_783", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.0167491Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0167739Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0168104Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0169411Z triton_flex_attention_783 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0170746Z triton_flex_attention_784 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0172096Z triton_flex_attention_779 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0173397Z triton_flex_attention_781 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.0174691Z triton_flex_attention_782 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.0175994Z triton_flex_attention_780 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0176277Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3189 seconds precompiling for 6 choices 2025-12-04T10:01:25.0176354Z Autotune Choices Stats: 2025-12-04T10:01:25.0178041Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_786", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.0178566Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0178938Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0179631Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0181040Z triton_flex_attention_backward_786 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0182385Z triton_flex_attention_backward_787 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0183732Z triton_flex_attention_backward_788 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0185070Z triton_flex_attention_backward_785 0.0145 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0186408Z triton_flex_attention_backward_790 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0187853Z triton_flex_attention_backward_791 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0189196Z triton_flex_attention_backward_792 0.0155 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0190575Z triton_flex_attention_backward_789 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0191978Z triton_flex_attention_backward_794 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0193315Z triton_flex_attention_backward_797 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.0193598Z SingleProcess AUTOTUNE benchmarking takes 0.6703 seconds and 2.2711 seconds precompiling for 13 choices 2025-12-04T10:01:25.0193743Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.0193815Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.0193880Z unimplemented [] 2025-12-04T10:01:25.0193993Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.0194203Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.0195599Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.0195683Z graph_break [] 2025-12-04T10:01:25.0195821Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.0195899Z Autotune Choices Stats: 2025-12-04T10:01:25.0197542Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_803", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.0197843Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0198091Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0198455Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0199753Z triton_flex_attention_803 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0201136Z triton_flex_attention_802 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0202426Z triton_flex_attention_800 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.0203717Z triton_flex_attention_798 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0204998Z triton_flex_attention_801 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.0206286Z triton_flex_attention_799 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0206568Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T10:01:25.0206677Z Autotune Choices Stats: 2025-12-04T10:01:25.0208333Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_806", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.0208855Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0209254Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0209943Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0211330Z triton_flex_attention_backward_806 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0212682Z triton_flex_attention_backward_805 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0214008Z triton_flex_attention_backward_807 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0215346Z triton_flex_attention_backward_804 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0216699Z triton_flex_attention_backward_809 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0218047Z triton_flex_attention_backward_810 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0219374Z triton_flex_attention_backward_811 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0220834Z triton_flex_attention_backward_808 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0222179Z triton_flex_attention_backward_812 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0223517Z triton_flex_attention_backward_813 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0223795Z SingleProcess AUTOTUNE benchmarking takes 0.6698 seconds and 2.2839 seconds precompiling for 13 choices 2025-12-04T10:01:25.0223999Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:25.0224079Z Traceback (most recent call last): 2025-12-04T10:01:25.0224434Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:25.0224504Z self.assertTrue( 2025-12-04T10:01:25.0224731Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:25.0224814Z raise self.failureException(msg) 2025-12-04T10:01:25.0225096Z AssertionError: False is not true : Log file /tmp/tmpxa3ik349/flex_attention_configs.json was not created 2025-12-04T10:01:25.0225101Z 2025-12-04T10:01:25.0225244Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:25.0225545Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:25.0225550Z 2025-12-04T10:01:25.0225731Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:25.0225926Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.0226007Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.0226072Z unimplemented [] 2025-12-04T10:01:25.0226185Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.0227639Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:25.0227846Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.0227953Z graph_break [] 2025-12-04T10:01:25.0228098Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.0229313Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:25.0229399Z current_size = base.storage().size() 2025-12-04T10:01:25.0229501Z Autotune Choices Stats: 2025-12-04T10:01:25.0231120Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:25.0231415Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0231666Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0232028Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0233334Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0234608Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0235928Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.0237216Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0238502Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.0239902Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0240186Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:25.0240264Z Autotune Choices Stats: 2025-12-04T10:01:25.0241912Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.0242436Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0242804Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0243452Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0244806Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0246175Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0247500Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0248859Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0250243Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0251577Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0252906Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0254231Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0255920Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0257273Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0257564Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:25.0257703Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.0257835Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.0257903Z unimplemented [] 2025-12-04T10:01:25.0258056Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.0258263Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.0259718Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.0259789Z graph_break [] 2025-12-04T10:01:25.0259927Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.0259998Z Autotune Choices Stats: 2025-12-04T10:01:25.0261616Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.0261904Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0262152Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0262523Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0263831Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0265152Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0266463Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.0267802Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.0269188Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0270473Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0270756Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:25.0270823Z Autotune Choices Stats: 2025-12-04T10:01:25.0272475Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.0274795Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0276016Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0277242Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0279875Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0282653Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0285406Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0288365Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0291462Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0294213Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0296966Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0299704Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0302693Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0305447Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0307281Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:25.0307809Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.0308111Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.0308321Z unimplemented [] 2025-12-04T10:01:25.0308538Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.0308938Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.0310683Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.0312222Z graph_break [] 2025-12-04T10:01:25.0312465Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.0312765Z Autotune Choices Stats: 2025-12-04T10:01:25.0314493Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.0316481Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0317104Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0317797Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0319553Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0322265Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0324925Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.0327633Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.0330366Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0337902Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0339676Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:25.0340140Z Autotune Choices Stats: 2025-12-04T10:01:25.0341932Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.0344192Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0345168Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0346359Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0348535Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0351285Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0354151Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0357221Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0359971Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0362693Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0365441Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0368271Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0371010Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.0373797Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0375555Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:25.0376126Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.0376436Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.0376640Z unimplemented [] 2025-12-04T10:01:25.0376854Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.0377267Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.0378966Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.0380501Z graph_break [] 2025-12-04T10:01:25.0380741Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.0381045Z Autotune Choices Stats: 2025-12-04T10:01:25.0382791Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.0384762Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0385385Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0386078Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0387923Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0390599Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0393252Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.0395996Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.0398643Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0401304Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0402940Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:25.0403380Z Autotune Choices Stats: 2025-12-04T10:01:25.0405146Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.0407399Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0408422Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0409526Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0411608Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0414414Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0417218Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0419972Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0422703Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0425443Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0428261Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0430991Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0433708Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0436551Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.0438240Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:25.0438757Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.0439060Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.0439269Z unimplemented [] 2025-12-04T10:01:25.0439485Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.0439885Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.0441584Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.0443122Z graph_break [] 2025-12-04T10:01:25.0443367Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.0443671Z Autotune Choices Stats: 2025-12-04T10:01:25.0445387Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.0447342Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0447956Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0448688Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0450439Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0453084Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0456123Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.0458807Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.0461457Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0464096Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0465751Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:25.0466199Z Autotune Choices Stats: 2025-12-04T10:01:25.0468075Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.0470340Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0471310Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0472405Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0474480Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0477383Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0480129Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0482874Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0485604Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0488347Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0491129Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0493871Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0496746Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.0499479Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0501175Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:25.0501689Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.0501994Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.0502199Z unimplemented [] 2025-12-04T10:01:25.0502418Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.0502824Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.0504524Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.0506047Z graph_break [] 2025-12-04T10:01:25.0506283Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.0506587Z Autotune Choices Stats: 2025-12-04T10:01:25.0508420Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.0510403Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0511012Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0511700Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0513445Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0516165Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0518873Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.0521524Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.0524186Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0526834Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0528481Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:25.0528923Z Autotune Choices Stats: 2025-12-04T10:01:25.0530728Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.0532989Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0533958Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0535087Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0537256Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0540020Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0542779Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0545528Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0548344Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0551124Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0553877Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0556926Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0559816Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0562557Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0564267Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:25.0564784Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.0565090Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.0565299Z unimplemented [] 2025-12-04T10:01:25.0565514Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.0565926Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.0567632Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.0569165Z graph_break [] 2025-12-04T10:01:25.0569404Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.0569701Z Autotune Choices Stats: 2025-12-04T10:01:25.0571495Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.0573479Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0574103Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0574795Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0576588Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0579314Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0581977Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0584641Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.0587364Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.0590027Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0591721Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:25.0592167Z Autotune Choices Stats: 2025-12-04T10:01:25.0593948Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.0596209Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0597252Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0598342Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0600448Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0603224Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0605980Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0608733Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0611521Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0614270Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0617020Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0619869Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0622605Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.0625347Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0627058Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:25.0627620Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.0627930Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.0628137Z unimplemented [] 2025-12-04T10:01:25.0628353Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.0628750Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.0630451Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.0631989Z graph_break [] 2025-12-04T10:01:25.0632274Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.0632586Z Autotune Choices Stats: 2025-12-04T10:01:25.0634317Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:25.0636301Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0636960Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0637714Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0639749Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0642419Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0645081Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.0647757Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.0650418Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0653105Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0654767Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:25.0655395Z Autotune Choices Stats: 2025-12-04T10:01:25.0657278Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.0659656Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0660676Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0661778Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0663875Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0666633Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0669442Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0672243Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0674989Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0677746Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0680595Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0683334Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0686068Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0688822Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.0690526Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:25.0691052Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.0691359Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.0691563Z unimplemented [] 2025-12-04T10:01:25.0691782Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.0692186Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.0693935Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.0695474Z graph_break [] 2025-12-04T10:01:25.0695711Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.0696028Z Autotune Choices Stats: 2025-12-04T10:01:25.0697764Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:25.0699800Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0700425Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0701149Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0702910Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0705577Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0708290Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0710953Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.0713664Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.0716328Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0717976Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:25.0718472Z Autotune Choices Stats: 2025-12-04T10:01:25.0720235Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.0722558Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0723535Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0724629Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0726713Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0729469Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0732207Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0734975Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0737743Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0740518Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0743326Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0746067Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0748874Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0751622Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.0753316Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:25.0753820Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.0754132Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.0754353Z unimplemented [] 2025-12-04T10:01:25.0754609Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.0755013Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.0756987Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.0758521Z graph_break [] 2025-12-04T10:01:25.0758755Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.0759134Z Autotune Choices Stats: 2025-12-04T10:01:25.0760859Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.0762915Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0763534Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0764224Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0765979Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0768649Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0771306Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.0774016Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.0776695Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0779356Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0781086Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:25.0781513Z Autotune Choices Stats: 2025-12-04T10:01:25.0783316Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.0785564Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0786538Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0787704Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0789792Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0792553Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0795351Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0798116Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0800903Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0803719Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0806468Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0809206Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0811950Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0814761Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.0816464Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:25.0816976Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.0817273Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.0817481Z unimplemented [] 2025-12-04T10:01:25.0817698Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.0818099Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.0819795Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.0821400Z graph_break [] 2025-12-04T10:01:25.0821632Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.0821931Z Autotune Choices Stats: 2025-12-04T10:01:25.0823692Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:25.0825659Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0826273Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0826959Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0828739Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0831394Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0834047Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.0836751Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.0839424Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0842110Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0843829Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:25.0844265Z Autotune Choices Stats: 2025-12-04T10:01:25.0846030Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.0848287Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0849260Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0850355Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0852436Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0855435Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0858291Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0861043Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0863966Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0866712Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0869540Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0872288Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0875032Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.0877830Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0879531Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:25.0880047Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.0880353Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.0880559Z unimplemented [] 2025-12-04T10:01:25.0880776Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.0881238Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.0882939Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.0884509Z graph_break [] 2025-12-04T10:01:25.0884775Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.0885077Z Autotune Choices Stats: 2025-12-04T10:01:25.0886813Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.0888785Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0889407Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0890090Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0891838Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0894503Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0897204Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.0899874Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.0902550Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0905276Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0906933Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:25.0907427Z Autotune Choices Stats: 2025-12-04T10:01:25.0909204Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.0911446Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0912415Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0913511Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0915584Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0918384Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0921155Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0923970Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0926776Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0929524Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0932271Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0935009Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0937792Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.0940544Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0942235Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:25.0942784Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.0943149Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.0943361Z unimplemented [] 2025-12-04T10:01:25.0943577Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.0943975Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.0945695Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.0947267Z graph_break [] 2025-12-04T10:01:25.0947512Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.0947814Z Autotune Choices Stats: 2025-12-04T10:01:25.0949548Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:25.0951518Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0952141Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0952835Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0954604Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0957514Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0960183Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0962847Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.0965655Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.0968339Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0969997Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:25.0970429Z Autotune Choices Stats: 2025-12-04T10:01:25.0972204Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.0974452Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.0975430Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.0976526Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.0978654Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0981424Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0984225Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.0987040Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0989854Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.0992605Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.0995343Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.0998147Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1000904Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1003655Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1005414Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:25.1005937Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1006238Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1006442Z unimplemented [] 2025-12-04T10:01:25.1006657Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1007095Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1008787Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1010327Z graph_break [] 2025-12-04T10:01:25.1010562Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1010864Z Autotune Choices Stats: 2025-12-04T10:01:25.1012595Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.1014573Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1015196Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1015885Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1017680Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1020356Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1023040Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1025787Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1028522Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1031185Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1032842Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:25.1033286Z Autotune Choices Stats: 2025-12-04T10:01:25.1035055Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.1037309Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1038276Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1039418Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1041504Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1044274Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1047142Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1049887Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1052644Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1055652Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1058412Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1061217Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1063969Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1066805Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.1068627Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:25.1069141Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1069442Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1069651Z unimplemented [] 2025-12-04T10:01:25.1069859Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1070261Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1071948Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1073479Z graph_break [] 2025-12-04T10:01:25.1073706Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1074005Z Autotune Choices Stats: 2025-12-04T10:01:25.1075734Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.1077712Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1078333Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1079017Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1080805Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1083465Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1086152Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1088884Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1091541Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1094199Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1095861Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:25.1096305Z Autotune Choices Stats: 2025-12-04T10:01:25.1098077Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.1100379Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1101355Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1102471Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1104549Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1107515Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1110267Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1113016Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1115760Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1118528Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1121307Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1124065Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1126849Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.1129709Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1131418Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:25.1131933Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1132239Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1132447Z unimplemented [] 2025-12-04T10:01:25.1132664Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1133060Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1134757Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1136296Z graph_break [] 2025-12-04T10:01:25.1136533Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1136830Z Autotune Choices Stats: 2025-12-04T10:01:25.1138558Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.1140530Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1141189Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1141889Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1143644Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1146315Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1149109Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1151761Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1154438Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1157325Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1157615Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:25.1157691Z Autotune Choices Stats: 2025-12-04T10:01:25.1159424Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.1159958Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1160323Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1160978Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1162379Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1163860Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1165194Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1166526Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1167854Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1169224Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1170560Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1171890Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1173312Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1174647Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.1174942Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:25.1175080Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1175162Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1175227Z unimplemented [] 2025-12-04T10:01:25.1175337Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1175550Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1176955Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1177025Z graph_break [] 2025-12-04T10:01:25.1177159Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1177231Z Autotune Choices Stats: 2025-12-04T10:01:25.1178885Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.1179179Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1179431Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1179792Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1181089Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1182480Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1183776Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1185080Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1186358Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1187709Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1187996Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:25.1188072Z Autotune Choices Stats: 2025-12-04T10:01:25.1189772Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.1190303Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1190669Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1191384Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1192762Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1194113Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1195447Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1196790Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1198151Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1199493Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1200828Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1202273Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1203606Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1204926Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.1205210Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:25.1205351Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1205427Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1205492Z unimplemented [] 2025-12-04T10:01:25.1205601Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1205809Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1207208Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1207277Z graph_break [] 2025-12-04T10:01:25.1207414Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1207484Z Autotune Choices Stats: 2025-12-04T10:01:25.1209137Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.1209427Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1209673Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1210066Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1211400Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1212724Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1214023Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1215310Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1216614Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1217937Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1218224Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:25.1218299Z Autotune Choices Stats: 2025-12-04T10:01:25.1219952Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.1220511Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1220907Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1221583Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1222931Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1224271Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1225600Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1226944Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1228379Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1229724Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1231092Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1232484Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1233811Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1235139Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1235428Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:25.1235563Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1235637Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1235701Z unimplemented [] 2025-12-04T10:01:25.1235807Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1236024Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1237460Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1237534Z graph_break [] 2025-12-04T10:01:25.1237666Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1237733Z Autotune Choices Stats: 2025-12-04T10:01:25.1239345Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.1239686Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1239963Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1240321Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1241666Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1242953Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1244238Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1245532Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1246817Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1248138Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1248421Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:25.1248497Z Autotune Choices Stats: 2025-12-04T10:01:25.1250150Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.1250747Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1251142Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1251797Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1253150Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1254504Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1256159Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1257587Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1258930Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1260269Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1261752Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1263094Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1264425Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.1265753Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1266044Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:25.1266185Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1266270Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1266339Z unimplemented [] 2025-12-04T10:01:25.1266451Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1266664Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1268155Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1268232Z graph_break [] 2025-12-04T10:01:25.1268370Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1268441Z Autotune Choices Stats: 2025-12-04T10:01:25.1270048Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.1270405Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1270687Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1271046Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1272357Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1273635Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1274934Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1276219Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1277562Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1278860Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1279182Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:25.1279291Z Autotune Choices Stats: 2025-12-04T10:01:25.1280971Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.1281496Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1281861Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1282517Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1283871Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1285212Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1286579Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1287915Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1289247Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1290688Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1292028Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1293358Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1294688Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1296010Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.1296302Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:25.1296477Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1296558Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1296627Z unimplemented [] 2025-12-04T10:01:25.1296735Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1296947Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1298344Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1298445Z graph_break [] 2025-12-04T10:01:25.1298585Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1298730Z Autotune Choices Stats: 2025-12-04T10:01:25.1300375Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.1300664Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1300912Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1301272Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1302573Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1303854Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1305154Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1306486Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1307824Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1309113Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1309463Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:25.1309531Z Autotune Choices Stats: 2025-12-04T10:01:25.1311235Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.1311765Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1312132Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1312787Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1314135Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1315482Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1317077Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1318575Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1319931Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1321338Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1322680Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1324011Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1325352Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1326713Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.1327007Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:25.1327145Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1327221Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1327285Z unimplemented [] 2025-12-04T10:01:25.1327404Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1327618Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1329017Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1329152Z graph_break [] 2025-12-04T10:01:25.1329285Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1329354Z Autotune Choices Stats: 2025-12-04T10:01:25.1330998Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.1331289Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1331538Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1331897Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1333204Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1334488Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1335848Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1337389Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1338704Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1340099Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1340384Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:25.1340455Z Autotune Choices Stats: 2025-12-04T10:01:25.1342119Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:25.1342641Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1343008Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1343655Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1345007Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1346395Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1347778Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1349158Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1350574Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1351910Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1353240Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1354570Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1356194Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1357554Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.1357848Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:25.1357983Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1358109Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1358174Z unimplemented [] 2025-12-04T10:01:25.1358280Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1358488Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1359964Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1360036Z graph_break [] 2025-12-04T10:01:25.1360173Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1360241Z Autotune Choices Stats: 2025-12-04T10:01:25.1361854Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.1362142Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1362386Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1362742Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1364042Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1365326Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1366658Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1367947Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1369274Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1370624Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1370907Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:25.1370974Z Autotune Choices Stats: 2025-12-04T10:01:25.1372631Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.1373149Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1373518Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1374166Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1375548Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1376893Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1378231Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1379661Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1380989Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1382339Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1383671Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1385002Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1386363Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1387784Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1388131Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:25.1388302Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1388372Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1388443Z unimplemented [] 2025-12-04T10:01:25.1388550Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1388762Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1390202Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1390275Z graph_break [] 2025-12-04T10:01:25.1390411Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1390479Z Autotune Choices Stats: 2025-12-04T10:01:25.1392098Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.1392386Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1392635Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1392994Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1394293Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1395607Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1396902Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1398226Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1399581Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1400872Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1401162Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:25.1401233Z Autotune Choices Stats: 2025-12-04T10:01:25.1402897Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.1403417Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1403785Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1404426Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1405808Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1407138Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1408578Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1409919Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1411253Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1412584Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1413913Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1415281Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1416615Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1417937Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.1418290Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:25.1418427Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1418777Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1418851Z unimplemented [] 2025-12-04T10:01:25.1418957Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1419167Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1420561Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1420631Z graph_break [] 2025-12-04T10:01:25.1420766Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1420835Z Autotune Choices Stats: 2025-12-04T10:01:25.1422445Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.1422731Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1422977Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1423336Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1424690Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1425982Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1427334Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1428717Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1430007Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1431298Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1431582Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:25.1431652Z Autotune Choices Stats: 2025-12-04T10:01:25.1433310Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.1433827Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1434236Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1434880Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1436235Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1437603Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1439001Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1440332Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1441656Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1442987Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1444347Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1445680Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1447018Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1448454Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.1448739Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:25.1448876Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1448957Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1449029Z unimplemented [] 2025-12-04T10:01:25.1449138Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1449349Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1450734Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1450797Z graph_break [] 2025-12-04T10:01:25.1450940Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1451009Z Autotune Choices Stats: 2025-12-04T10:01:25.1452617Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.1452902Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1453144Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1453542Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1454853Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1456450Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1457980Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1459267Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1460567Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1461864Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1462153Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:25.1462221Z Autotune Choices Stats: 2025-12-04T10:01:25.1463876Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:25.1464440Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1464821Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1465470Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1466832Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1468334Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1469681Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1471015Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1472344Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1473684Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1475040Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1476395Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1477792Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1479141Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.1479439Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:25.1479578Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1479648Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1479718Z unimplemented [] 2025-12-04T10:01:25.1479827Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1480036Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1481429Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1481496Z graph_break [] 2025-12-04T10:01:25.1481640Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1481709Z Autotune Choices Stats: 2025-12-04T10:01:25.1483324Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.1483650Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1483900Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1484257Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1485564Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1486871Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1488233Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1489512Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1490796Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1492082Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1492369Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:25.1492436Z Autotune Choices Stats: 2025-12-04T10:01:25.1494133Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.1494652Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1495021Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1495703Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1497132Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1498476Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1499830Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1501167Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1502497Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1503866Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1505194Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1506571Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1508022Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1509349Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1509640Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:25.1509775Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1509845Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1509916Z unimplemented [] 2025-12-04T10:01:25.1510023Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1510226Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1511632Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1511694Z graph_break [] 2025-12-04T10:01:25.1511835Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1511904Z Autotune Choices Stats: 2025-12-04T10:01:25.1513563Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.1513853Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1514100Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1514459Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1515787Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1517139Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1518437Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1519726Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1521015Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1522300Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1522633Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:25.1522706Z Autotune Choices Stats: 2025-12-04T10:01:25.1524367Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.1524883Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1525286Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1525959Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1527336Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1528674Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1530011Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1531348Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1532708Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1534046Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1535371Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1536819Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1538153Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1539478Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1539764Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:25.1539898Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1539970Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1540041Z unimplemented [] 2025-12-04T10:01:25.1540146Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1540348Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1541746Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1546721Z graph_break [] 2025-12-04T10:01:25.1547014Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1547099Z Autotune Choices Stats: 2025-12-04T10:01:25.1548821Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.1549119Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1549415Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1549807Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1551152Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1552443Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1553744Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1555015Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1556643Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1558018Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1558316Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:25.1558388Z Autotune Choices Stats: 2025-12-04T10:01:25.1560057Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:25.1560689Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1561108Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1561756Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1563115Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1564448Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1565781Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1567134Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1568463Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1569787Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1571207Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1572539Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1573862Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1575191Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1575482Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:25.1575628Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1575701Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1575772Z unimplemented [] 2025-12-04T10:01:25.1575883Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1576093Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1577561Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1577629Z graph_break [] 2025-12-04T10:01:25.1577773Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1577842Z Autotune Choices Stats: 2025-12-04T10:01:25.1579454Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.1579806Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1580049Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1580445Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1581758Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1583073Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1584380Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1585668Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1587001Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1588380Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1588672Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:25.1588782Z Autotune Choices Stats: 2025-12-04T10:01:25.1590437Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.1591029Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1591405Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1592058Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1593411Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1594746Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1596091Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1597480Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1598831Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1600194Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1601586Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1602922Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1604251Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1605589Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1605877Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:25.1606028Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1606105Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1606176Z unimplemented [] 2025-12-04T10:01:25.1606322Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1606534Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1607942Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1608005Z graph_break [] 2025-12-04T10:01:25.1608149Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1608255Z Autotune Choices Stats: 2025-12-04T10:01:25.1609875Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_574", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.1610248Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1610490Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1610871Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1612159Z triton_flex_attention_574 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1613456Z triton_flex_attention_575 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1614751Z triton_flex_attention_572 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1616114Z triton_flex_attention_570 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1617413Z triton_flex_attention_573 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1618700Z triton_flex_attention_571 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1619051Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3286 seconds precompiling for 6 choices 2025-12-04T10:01:25.1619122Z Autotune Choices Stats: 2025-12-04T10:01:25.1620800Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014303999952971935, "best_triton_pos": 0} 2025-12-04T10:01:25.1621324Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1621695Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1622343Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1623689Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1625023Z triton_flex_attention_backward_576 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1626405Z triton_flex_attention_backward_577 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1627785Z triton_flex_attention_backward_578 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1629121Z triton_flex_attention_backward_581 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1630564Z triton_flex_attention_backward_583 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1631898Z triton_flex_attention_backward_580 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1633237Z triton_flex_attention_backward_582 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1634569Z triton_flex_attention_backward_585 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1635906Z triton_flex_attention_backward_588 0.0174 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.1636231Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2260 seconds precompiling for 13 choices 2025-12-04T10:01:25.1636377Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1636452Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1636522Z unimplemented [] 2025-12-04T10:01:25.1636629Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1636834Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1638240Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1638370Z graph_break [] 2025-12-04T10:01:25.1638510Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1638578Z Autotune Choices Stats: 2025-12-04T10:01:25.1640212Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.1640507Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1640750Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1641110Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1642407Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1643693Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1644986Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1646286Z triton_flex_attention_589 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1647583Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1648904Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1649238Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.2961 seconds precompiling for 6 choices 2025-12-04T10:01:25.1649336Z Autotune Choices Stats: 2025-12-04T10:01:25.1650995Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.1651513Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1651881Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1652527Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1653869Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1655494Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1656896Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1658236Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1659713Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1661062Z triton_flex_attention_backward_601 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1662393Z triton_flex_attention_backward_602 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1663732Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1665059Z triton_flex_attention_backward_604 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1666434Z triton_flex_attention_backward_603 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1666716Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.3242 seconds precompiling for 13 choices 2025-12-04T10:01:25.1666865Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1666935Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1667007Z unimplemented [] 2025-12-04T10:01:25.1667112Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1667387Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1668825Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1668921Z graph_break [] 2025-12-04T10:01:25.1669062Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1669166Z Autotune Choices Stats: 2025-12-04T10:01:25.1670779Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.1671075Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1671314Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1671673Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1672963Z triton_flex_attention_612 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1674248Z triton_flex_attention_613 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1675567Z triton_flex_attention_608 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1676866Z triton_flex_attention_610 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1678161Z triton_flex_attention_611 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1679537Z triton_flex_attention_609 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1679828Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3064 seconds precompiling for 6 choices 2025-12-04T10:01:25.1679898Z Autotune Choices Stats: 2025-12-04T10:01:25.1681563Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_616", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.1682078Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1682447Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1683095Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1684440Z triton_flex_attention_backward_616 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1685816Z triton_flex_attention_backward_615 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1687162Z triton_flex_attention_backward_617 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1688575Z triton_flex_attention_backward_614 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1689945Z triton_flex_attention_backward_619 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1691276Z triton_flex_attention_backward_620 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1692606Z triton_flex_attention_backward_618 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1693939Z triton_flex_attention_backward_621 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1695301Z triton_flex_attention_backward_623 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1696660Z triton_flex_attention_backward_622 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1696943Z SingleProcess AUTOTUNE benchmarking takes 0.6701 seconds and 2.3164 seconds precompiling for 13 choices 2025-12-04T10:01:25.1697125Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1697197Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1697301Z unimplemented [] 2025-12-04T10:01:25.1697411Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1697616Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1699043Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1699111Z graph_break [] 2025-12-04T10:01:25.1699254Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1699324Z Autotune Choices Stats: 2025-12-04T10:01:25.1700931Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.1701218Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1701454Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1701823Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1703123Z triton_flex_attention_631 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1704455Z triton_flex_attention_632 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1705752Z triton_flex_attention_629 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1707045Z triton_flex_attention_627 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1708489Z triton_flex_attention_630 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1709776Z triton_flex_attention_628 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1710063Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.2948 seconds precompiling for 6 choices 2025-12-04T10:01:25.1710129Z Autotune Choices Stats: 2025-12-04T10:01:25.1711785Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.1712305Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1712669Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1713319Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1714708Z triton_flex_attention_backward_634 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1716061Z triton_flex_attention_backward_635 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1717434Z triton_flex_attention_backward_636 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1718829Z triton_flex_attention_backward_633 0.0144 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1720169Z triton_flex_attention_backward_638 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1721505Z triton_flex_attention_backward_640 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1722835Z triton_flex_attention_backward_637 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1724224Z triton_flex_attention_backward_639 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1725554Z triton_flex_attention_backward_642 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1726889Z triton_flex_attention_backward_641 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1727229Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.2589 seconds precompiling for 13 choices 2025-12-04T10:01:25.1727369Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1727438Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1727501Z unimplemented [] 2025-12-04T10:01:25.1727608Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1727844Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1729246Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1729316Z graph_break [] 2025-12-04T10:01:25.1729457Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1729527Z Autotune Choices Stats: 2025-12-04T10:01:25.1731124Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.1731414Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1731654Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1732015Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1733356Z triton_flex_attention_650 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1734651Z triton_flex_attention_651 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1735943Z triton_flex_attention_646 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1737308Z triton_flex_attention_648 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1738626Z triton_flex_attention_649 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1739910Z triton_flex_attention_647 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1740191Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3235 seconds precompiling for 6 choices 2025-12-04T10:01:25.1740258Z Autotune Choices Stats: 2025-12-04T10:01:25.1741912Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_653", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.1742434Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1742806Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1743493Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1744842Z triton_flex_attention_backward_653 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1746181Z triton_flex_attention_backward_654 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1747665Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1749014Z triton_flex_attention_backward_652 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1750351Z triton_flex_attention_backward_656 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1751693Z triton_flex_attention_backward_657 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1753028Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1754396Z triton_flex_attention_backward_658 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1756014Z triton_flex_attention_backward_661 0.0164 ms 80.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1757438Z triton_flex_attention_backward_660 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1757836Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.4000 seconds precompiling for 13 choices 2025-12-04T10:01:25.1757981Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1758052Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1758117Z unimplemented [] 2025-12-04T10:01:25.1758229Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1758434Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1759841Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1759907Z graph_break [] 2025-12-04T10:01:25.1760050Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1760118Z Autotune Choices Stats: 2025-12-04T10:01:25.1761729Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.1762022Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1762267Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1762636Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1763998Z triton_flex_attention_669 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1765292Z triton_flex_attention_670 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1766630Z triton_flex_attention_665 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1768022Z triton_flex_attention_667 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1769318Z triton_flex_attention_668 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1770616Z triton_flex_attention_666 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1770905Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.2940 seconds precompiling for 6 choices 2025-12-04T10:01:25.1770975Z Autotune Choices Stats: 2025-12-04T10:01:25.1772623Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.1773180Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1773547Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1774194Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1775537Z triton_flex_attention_backward_673 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1776979Z triton_flex_attention_backward_674 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1778319Z triton_flex_attention_backward_671 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1779648Z triton_flex_attention_backward_672 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1780978Z triton_flex_attention_backward_676 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1782305Z triton_flex_attention_backward_675 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1783665Z triton_flex_attention_backward_677 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1784995Z triton_flex_attention_backward_678 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1786363Z triton_flex_attention_backward_680 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1787823Z triton_flex_attention_backward_683 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.1788106Z SingleProcess AUTOTUNE benchmarking takes 0.6679 seconds and 2.3879 seconds precompiling for 13 choices 2025-12-04T10:01:25.1788251Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1788321Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1788386Z unimplemented [] 2025-12-04T10:01:25.1788497Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1788706Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1790109Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1790172Z graph_break [] 2025-12-04T10:01:25.1790305Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1790381Z Autotune Choices Stats: 2025-12-04T10:01:25.1791980Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_689", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010367999784648418, "best_triton_pos": 0} 2025-12-04T10:01:25.1792266Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1792548Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1792917Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1794219Z triton_flex_attention_689 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1795516Z triton_flex_attention_688 0.0113 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1796936Z triton_flex_attention_684 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1798222Z triton_flex_attention_686 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1799526Z triton_flex_attention_687 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1800803Z triton_flex_attention_685 0.0164 ms 63.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1801093Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3120 seconds precompiling for 6 choices 2025-12-04T10:01:25.1801163Z Autotune Choices Stats: 2025-12-04T10:01:25.1802854Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_690", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.1803382Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1803747Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1804395Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1805772Z triton_flex_attention_backward_690 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1807177Z triton_flex_attention_backward_691 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1808513Z triton_flex_attention_backward_692 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1809844Z triton_flex_attention_backward_693 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1811185Z triton_flex_attention_backward_695 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1812543Z triton_flex_attention_backward_696 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1813875Z triton_flex_attention_backward_697 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1815211Z triton_flex_attention_backward_694 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1816623Z triton_flex_attention_backward_699 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1817958Z triton_flex_attention_backward_702 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.1818242Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.3417 seconds precompiling for 13 choices 2025-12-04T10:01:25.1818386Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1818458Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1818525Z unimplemented [] 2025-12-04T10:01:25.1818637Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1818841Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1820238Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1820302Z graph_break [] 2025-12-04T10:01:25.1820437Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1820510Z Autotune Choices Stats: 2025-12-04T10:01:25.1822152Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.1822447Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1822685Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1823048Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1824351Z triton_flex_attention_707 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1825735Z triton_flex_attention_708 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1827033Z triton_flex_attention_705 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1828379Z triton_flex_attention_706 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1829664Z triton_flex_attention_703 0.0144 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1830954Z triton_flex_attention_704 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1831238Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.2972 seconds precompiling for 6 choices 2025-12-04T10:01:25.1831308Z Autotune Choices Stats: 2025-12-04T10:01:25.1832992Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_709", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.1833528Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1833888Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1834625Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1836019Z triton_flex_attention_backward_709 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1837376Z triton_flex_attention_backward_710 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1838722Z triton_flex_attention_backward_711 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1840050Z triton_flex_attention_backward_712 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1841390Z triton_flex_attention_backward_714 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1842760Z triton_flex_attention_backward_715 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1844099Z triton_flex_attention_backward_713 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1845536Z triton_flex_attention_backward_716 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1846877Z triton_flex_attention_backward_721 0.0174 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.1848227Z triton_flex_attention_backward_717 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1848508Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.1509 seconds precompiling for 13 choices 2025-12-04T10:01:25.1848650Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1848724Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1848787Z unimplemented [] 2025-12-04T10:01:25.1848915Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1849130Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1850535Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1850597Z graph_break [] 2025-12-04T10:01:25.1850731Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1850807Z Autotune Choices Stats: 2025-12-04T10:01:25.1852448Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.1852742Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1852984Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1853384Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1854713Z triton_flex_attention_726 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1856246Z triton_flex_attention_727 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1857544Z triton_flex_attention_722 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1858848Z triton_flex_attention_724 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1860136Z triton_flex_attention_725 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1861491Z triton_flex_attention_723 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1861785Z SingleProcess AUTOTUNE benchmarking takes 0.2941 seconds and 1.2814 seconds precompiling for 6 choices 2025-12-04T10:01:25.1861856Z Autotune Choices Stats: 2025-12-04T10:01:25.1863509Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_729", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.1864081Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1864494Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1865177Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1866531Z triton_flex_attention_backward_729 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1867923Z triton_flex_attention_backward_730 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1869263Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1870592Z triton_flex_attention_backward_733 0.0144 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1871979Z triton_flex_attention_backward_728 0.0153 ms 87.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1873311Z triton_flex_attention_backward_735 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1874671Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1876087Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1877445Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1878780Z triton_flex_attention_backward_740 0.0173 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.1879061Z SingleProcess AUTOTUNE benchmarking takes 0.6682 seconds and 2.2393 seconds precompiling for 13 choices 2025-12-04T10:01:25.1879204Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1879275Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1879337Z unimplemented [] 2025-12-04T10:01:25.1879451Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1879657Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1881088Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1881156Z graph_break [] 2025-12-04T10:01:25.1881289Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1881365Z Autotune Choices Stats: 2025-12-04T10:01:25.1882963Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.1883289Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1883574Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1883937Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1885262Z triton_flex_attention_745 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1886562Z triton_flex_attention_746 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1887852Z triton_flex_attention_743 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1889143Z triton_flex_attention_741 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1890424Z triton_flex_attention_744 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1891756Z triton_flex_attention_742 0.0164 ms 62.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1892045Z SingleProcess AUTOTUNE benchmarking takes 0.2954 seconds and 1.3187 seconds precompiling for 6 choices 2025-12-04T10:01:25.1892122Z Autotune Choices Stats: 2025-12-04T10:01:25.1893773Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_750", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.1894364Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1894758Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1897648Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1899059Z triton_flex_attention_backward_750 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1900446Z triton_flex_attention_backward_748 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1901786Z triton_flex_attention_backward_749 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1903118Z triton_flex_attention_backward_753 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1904439Z triton_flex_attention_backward_747 0.0144 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1905760Z triton_flex_attention_backward_752 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1907324Z triton_flex_attention_backward_754 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1908747Z triton_flex_attention_backward_751 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1910095Z triton_flex_attention_backward_756 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1911425Z triton_flex_attention_backward_759 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.1911727Z SingleProcess AUTOTUNE benchmarking takes 0.6710 seconds and 2.3823 seconds precompiling for 13 choices 2025-12-04T10:01:25.1911875Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1911954Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1912020Z unimplemented [] 2025-12-04T10:01:25.1912131Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1912349Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1913746Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1913817Z graph_break [] 2025-12-04T10:01:25.1913959Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1914029Z Autotune Choices Stats: 2025-12-04T10:01:25.1915641Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_765", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010304000228643417, "best_triton_pos": 0} 2025-12-04T10:01:25.1916009Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1916296Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1916657Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1918011Z triton_flex_attention_765 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1919297Z triton_flex_attention_764 0.0113 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1920582Z triton_flex_attention_762 0.0133 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1921855Z triton_flex_attention_760 0.0143 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1923141Z triton_flex_attention_763 0.0143 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1924424Z triton_flex_attention_761 0.0154 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1924746Z SingleProcess AUTOTUNE benchmarking takes 0.2951 seconds and 1.3301 seconds precompiling for 6 choices 2025-12-04T10:01:25.1924818Z Autotune Choices Stats: 2025-12-04T10:01:25.1926585Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_767", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.1927112Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1927518Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1928161Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1929505Z triton_flex_attention_backward_767 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1930846Z triton_flex_attention_backward_769 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1932169Z triton_flex_attention_backward_766 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1933491Z triton_flex_attention_backward_768 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1934805Z triton_flex_attention_backward_771 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1936485Z triton_flex_attention_backward_772 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1937854Z triton_flex_attention_backward_770 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1939181Z triton_flex_attention_backward_773 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1940503Z triton_flex_attention_backward_775 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1941829Z triton_flex_attention_backward_778 0.0174 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.1942116Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2444 seconds precompiling for 13 choices 2025-12-04T10:01:25.1942256Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1942340Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1942404Z unimplemented [] 2025-12-04T10:01:25.1942511Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1942722Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1944116Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1944225Z graph_break [] 2025-12-04T10:01:25.1944359Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1944426Z Autotune Choices Stats: 2025-12-04T10:01:25.1946114Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_783", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.1946401Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1946659Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1947055Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1948404Z triton_flex_attention_783 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1949683Z triton_flex_attention_784 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1950974Z triton_flex_attention_779 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1952249Z triton_flex_attention_781 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1953538Z triton_flex_attention_782 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1954817Z triton_flex_attention_780 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1955167Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3189 seconds precompiling for 6 choices 2025-12-04T10:01:25.1955485Z Autotune Choices Stats: 2025-12-04T10:01:25.1957284Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_786", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.1957883Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1958250Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1958898Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1960246Z triton_flex_attention_backward_786 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1961567Z triton_flex_attention_backward_787 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1962886Z triton_flex_attention_backward_788 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1964224Z triton_flex_attention_backward_785 0.0145 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1965591Z triton_flex_attention_backward_790 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1967001Z triton_flex_attention_backward_791 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1968370Z triton_flex_attention_backward_792 0.0155 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1969698Z triton_flex_attention_backward_789 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1971029Z triton_flex_attention_backward_794 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1972351Z triton_flex_attention_backward_797 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.1972640Z SingleProcess AUTOTUNE benchmarking takes 0.6703 seconds and 2.2711 seconds precompiling for 13 choices 2025-12-04T10:01:25.1972778Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.1972854Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.1972919Z unimplemented [] 2025-12-04T10:01:25.1973027Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.1973243Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.1974638Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.1974776Z graph_break [] 2025-12-04T10:01:25.1974909Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.1974980Z Autotune Choices Stats: 2025-12-04T10:01:25.1976613Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_803", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.1976937Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1977185Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1977545Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1978843Z triton_flex_attention_803 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1980138Z triton_flex_attention_802 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1981442Z triton_flex_attention_800 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.1982723Z triton_flex_attention_798 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1984013Z triton_flex_attention_801 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.1985419Z triton_flex_attention_799 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1985703Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T10:01:25.1985772Z Autotune Choices Stats: 2025-12-04T10:01:25.1987561Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_806", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.1988092Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.1988459Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.1989108Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.1990460Z triton_flex_attention_backward_806 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1991803Z triton_flex_attention_backward_805 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1993135Z triton_flex_attention_backward_807 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1994499Z triton_flex_attention_backward_804 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.1995880Z triton_flex_attention_backward_809 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.1997242Z triton_flex_attention_backward_810 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.1998573Z triton_flex_attention_backward_811 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.1999899Z triton_flex_attention_backward_808 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2001232Z triton_flex_attention_backward_812 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2002557Z triton_flex_attention_backward_813 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2002850Z SingleProcess AUTOTUNE benchmarking takes 0.6698 seconds and 2.2839 seconds precompiling for 13 choices 2025-12-04T10:01:25.2002989Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2003066Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2003165Z unimplemented [] 2025-12-04T10:01:25.2003272Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2003484Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2004945Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2005018Z graph_break [] 2025-12-04T10:01:25.2005157Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2005224Z Autotune Choices Stats: 2025-12-04T10:01:25.2006866Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_821", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.2007155Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2007407Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2007765Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2009064Z triton_flex_attention_821 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2010339Z triton_flex_attention_822 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2011625Z triton_flex_attention_817 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2012909Z triton_flex_attention_819 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2014228Z triton_flex_attention_820 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2015575Z triton_flex_attention_818 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2015912Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3176 seconds precompiling for 6 choices 2025-12-04T10:01:25.2015983Z Autotune Choices Stats: 2025-12-04T10:01:25.2017635Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_825", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.2018152Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2018522Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2019169Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2020522Z triton_flex_attention_backward_825 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2021849Z triton_flex_attention_backward_824 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2023195Z triton_flex_attention_backward_826 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2024623Z triton_flex_attention_backward_823 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2025976Z triton_flex_attention_backward_828 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2027393Z triton_flex_attention_backward_827 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2028713Z triton_flex_attention_backward_829 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2030046Z triton_flex_attention_backward_830 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2031390Z triton_flex_attention_backward_832 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2032707Z triton_flex_attention_backward_835 0.0164 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2033036Z SingleProcess AUTOTUNE benchmarking takes 0.6673 seconds and 2.2875 seconds precompiling for 13 choices 2025-12-04T10:01:25.2033277Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:25.2033358Z Traceback (most recent call last): 2025-12-04T10:01:25.2033713Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:25.2033780Z self.assertTrue( 2025-12-04T10:01:25.2034021Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:25.2034138Z raise self.failureException(msg) 2025-12-04T10:01:25.2034418Z AssertionError: False is not true : Log file /tmp/tmpkgut0tc3/flex_attention_configs.json was not created 2025-12-04T10:01:25.2034423Z 2025-12-04T10:01:25.2034571Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:25.2034873Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:25.2034877Z 2025-12-04T10:01:25.2035102Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:25.2035244Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2035318Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2035389Z unimplemented [] 2025-12-04T10:01:25.2035500Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2036914Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:25.2037123Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2037189Z graph_break [] 2025-12-04T10:01:25.2037333Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2038497Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:25.2038596Z current_size = base.storage().size() 2025-12-04T10:01:25.2038667Z Autotune Choices Stats: 2025-12-04T10:01:25.2040286Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:25.2040578Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2040821Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2041183Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2042514Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2043861Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2045201Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2046488Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2047771Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2049045Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2049333Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:25.2049405Z Autotune Choices Stats: 2025-12-04T10:01:25.2051056Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.2051576Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2051983Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2052666Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2054056Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2055711Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2057064Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2058392Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2059713Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2061042Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2062359Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2063830Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2065203Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2066532Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2066819Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:25.2066966Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2067039Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2067109Z unimplemented [] 2025-12-04T10:01:25.2067287Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2067504Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2068901Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2068965Z graph_break [] 2025-12-04T10:01:25.2069106Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2069178Z Autotune Choices Stats: 2025-12-04T10:01:25.2070775Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2071079Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2071369Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2071776Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2073104Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2074419Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2075708Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2076996Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2078276Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2079548Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2079838Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:25.2079906Z Autotune Choices Stats: 2025-12-04T10:01:25.2081562Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.2082205Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2082572Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2083252Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2084625Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2085958Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2087291Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2088614Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2089943Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2091261Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2092664Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2094042Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2095371Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2096701Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2096987Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:25.2097133Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2097205Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2097275Z unimplemented [] 2025-12-04T10:01:25.2097385Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2097593Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2098994Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2099059Z graph_break [] 2025-12-04T10:01:25.2099196Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2099265Z Autotune Choices Stats: 2025-12-04T10:01:25.2100872Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2101245Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2101485Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2101848Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2103175Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2104493Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2105778Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2107069Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2108400Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2109687Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2109973Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:25.2110042Z Autotune Choices Stats: 2025-12-04T10:01:25.2111688Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.2112318Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2112686Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2113366Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2114716Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2116044Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2117373Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2118695Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2120023Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2121395Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2122778Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2124137Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2125455Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2126779Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2127061Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:25.2127202Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2127273Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2127346Z unimplemented [] 2025-12-04T10:01:25.2127453Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2127656Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2129056Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2129119Z graph_break [] 2025-12-04T10:01:25.2129264Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2129332Z Autotune Choices Stats: 2025-12-04T10:01:25.2130933Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2131316Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2131587Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2131947Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2133269Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2134548Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2135834Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2137135Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2138427Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2139704Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2140028Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:25.2140132Z Autotune Choices Stats: 2025-12-04T10:01:25.2141815Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.2142341Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2142744Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2143393Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2144726Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2146059Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2147433Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2148757Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2150082Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2151520Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2152881Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2154214Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2155782Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2157124Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2157407Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:25.2157551Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2157624Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2157696Z unimplemented [] 2025-12-04T10:01:25.2157801Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2158006Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2159406Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2159542Z graph_break [] 2025-12-04T10:01:25.2159682Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2159800Z Autotune Choices Stats: 2025-12-04T10:01:25.2161448Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2161740Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2162054Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2162424Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2163725Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2165014Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2166297Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2167578Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2168858Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2170177Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2170493Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:25.2170563Z Autotune Choices Stats: 2025-12-04T10:01:25.2172280Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.2172804Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2173171Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2173815Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2175165Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2176491Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2177820Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2179146Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2180568Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2181917Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2183245Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2184576Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2185892Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2187272Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2187575Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:25.2187717Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2187787Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2187858Z unimplemented [] 2025-12-04T10:01:25.2187965Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2188170Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2189574Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2189709Z graph_break [] 2025-12-04T10:01:25.2189853Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2189922Z Autotune Choices Stats: 2025-12-04T10:01:25.2191591Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2191885Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2192125Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2192488Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2193784Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2195070Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2196350Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2197633Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2198908Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2200295Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2200589Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:25.2200659Z Autotune Choices Stats: 2025-12-04T10:01:25.2202344Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.2202866Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2203234Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2203881Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2205225Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2206551Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2207882Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2209250Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2210651Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2212002Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2213334Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2214655Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2215981Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2217318Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2217600Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:25.2217744Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2217852Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2217914Z unimplemented [] 2025-12-04T10:01:25.2218028Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2218269Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2219701Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2219765Z graph_break [] 2025-12-04T10:01:25.2219906Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2219977Z Autotune Choices Stats: 2025-12-04T10:01:25.2221615Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2221909Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2222153Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2222515Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2223812Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2225096Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2226377Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2227733Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2229089Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2230395Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2230761Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:25.2230830Z Autotune Choices Stats: 2025-12-04T10:01:25.2232478Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.2233006Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2233375Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2234021Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2235354Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2236682Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2238018Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2239459Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2240823Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2242146Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2243474Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2244802Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2246128Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2247459Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2247808Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:25.2247954Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2248024Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2248088Z unimplemented [] 2025-12-04T10:01:25.2248200Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2248408Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2249833Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2249932Z graph_break [] 2025-12-04T10:01:25.2250069Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2250142Z Autotune Choices Stats: 2025-12-04T10:01:25.2251757Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:25.2252054Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2252300Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2252663Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2253953Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2255433Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2256799Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2258160Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2259534Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2260871Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2261167Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:25.2261246Z Autotune Choices Stats: 2025-12-04T10:01:25.2262907Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.2263438Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2263805Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2264454Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2265803Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2267125Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2268622Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2269976Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2271307Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2272626Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2273958Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2275286Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2276607Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2277981Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2278296Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:25.2278475Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2278548Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2278613Z unimplemented [] 2025-12-04T10:01:25.2278726Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2278935Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2280370Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2280434Z graph_break [] 2025-12-04T10:01:25.2280573Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2280647Z Autotune Choices Stats: 2025-12-04T10:01:25.2282248Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:25.2282539Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2282780Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2283146Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2284436Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2285715Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2286992Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2288375Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2289701Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2290984Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2291270Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:25.2291348Z Autotune Choices Stats: 2025-12-04T10:01:25.2292990Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.2293511Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2293874Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2294527Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2295865Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2297234Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2298626Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2299986Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2301315Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2302644Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2303967Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2305295Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2306615Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2308091Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2308377Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:25.2308520Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2308591Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2308709Z unimplemented [] 2025-12-04T10:01:25.2308823Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2309029Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2310429Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2310493Z graph_break [] 2025-12-04T10:01:25.2310631Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2310705Z Autotune Choices Stats: 2025-12-04T10:01:25.2312325Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2312615Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2312853Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2313221Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2314525Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2315810Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2317194Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2318517Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2319804Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2321092Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2321381Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:25.2321458Z Autotune Choices Stats: 2025-12-04T10:01:25.2323102Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.2323638Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2323998Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2324654Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2325991Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2327442Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2328806Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2330134Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2331463Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2332788Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2334118Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2335439Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2336879Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2338236Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2338529Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:25.2338675Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2338747Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2338811Z unimplemented [] 2025-12-04T10:01:25.2338925Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2339134Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2340541Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2340607Z graph_break [] 2025-12-04T10:01:25.2340741Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2340815Z Autotune Choices Stats: 2025-12-04T10:01:25.2342409Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:25.2342705Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2342945Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2343312Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2344602Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2345952Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2347328Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2348652Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2349930Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2351218Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2351504Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:25.2351571Z Autotune Choices Stats: 2025-12-04T10:01:25.2353212Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.2353737Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2354097Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2354786Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2356499Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2357899Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2359240Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2360562Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2361902Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2363220Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2364545Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2365929Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2367344Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2368708Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2368998Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:25.2369142Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2369214Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2369278Z unimplemented [] 2025-12-04T10:01:25.2369392Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2369597Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2371004Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2371070Z graph_break [] 2025-12-04T10:01:25.2371206Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2371279Z Autotune Choices Stats: 2025-12-04T10:01:25.2372888Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2373185Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2373427Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2373795Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2375127Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2376469Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2377804Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2379091Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2380372Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2381666Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2381949Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:25.2382023Z Autotune Choices Stats: 2025-12-04T10:01:25.2383662Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.2384190Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2384650Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2385304Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2386751Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2388143Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2389469Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2390810Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2392146Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2393464Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2394795Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2396242Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2397595Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2398926Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2399209Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:25.2399354Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2399425Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2399490Z unimplemented [] 2025-12-04T10:01:25.2399610Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2399819Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2401226Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2401291Z graph_break [] 2025-12-04T10:01:25.2401426Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2401504Z Autotune Choices Stats: 2025-12-04T10:01:25.2403119Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:25.2403408Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2403691Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2404084Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2405418Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2406744Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2408039Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2409328Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2410606Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2411898Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2412182Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:25.2412251Z Autotune Choices Stats: 2025-12-04T10:01:25.2413897Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.2414484Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2414876Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2415524Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2416892Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2418226Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2419566Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2420889Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2422218Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2423538Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2424992Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2426361Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2427744Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2429075Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2429356Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:25.2429496Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2429566Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2429631Z unimplemented [] 2025-12-04T10:01:25.2429739Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2429943Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2431333Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2431397Z graph_break [] 2025-12-04T10:01:25.2431529Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2431603Z Autotune Choices Stats: 2025-12-04T10:01:25.2433210Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2433570Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2433807Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2434200Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2435524Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2436825Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2438101Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2439393Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2440667Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2441952Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2442230Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:25.2442335Z Autotune Choices Stats: 2025-12-04T10:01:25.2443976Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.2444793Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2445159Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2445841Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2447196Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2448617Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2449961Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2451298Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2452630Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2453985Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2455842Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2457282Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2458616Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2459954Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2460239Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:25.2460387Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2460460Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2460524Z unimplemented [] 2025-12-04T10:01:25.2460640Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2460847Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2462258Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2462332Z graph_break [] 2025-12-04T10:01:25.2462472Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2462603Z Autotune Choices Stats: 2025-12-04T10:01:25.2464221Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2464601Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2464846Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2465209Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2466553Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2467894Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2469187Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2470494Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2471780Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2473074Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2473430Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:25.2473499Z Autotune Choices Stats: 2025-12-04T10:01:25.2475178Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.2475746Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2476112Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2476779Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2478125Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2479462Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2480802Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2482139Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2483507Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2484895Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2486278Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2487608Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2488935Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2490269Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2490553Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:25.2490693Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2490762Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2490826Z unimplemented [] 2025-12-04T10:01:25.2490935Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2491141Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2492537Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2492668Z graph_break [] 2025-12-04T10:01:25.2492804Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2492880Z Autotune Choices Stats: 2025-12-04T10:01:25.2494525Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2494816Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2495092Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2495451Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2496742Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2498035Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2499318Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2500603Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2501887Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2503211Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2503559Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:25.2503646Z Autotune Choices Stats: 2025-12-04T10:01:25.2505319Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.2505851Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2506217Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2506869Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2508261Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2509597Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2510929Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2512248Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2513677Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2515039Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2516372Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2517692Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2519025Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2520348Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2520627Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:25.2520769Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2520839Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2520903Z unimplemented [] 2025-12-04T10:01:25.2521011Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2521256Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2522659Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2522756Z graph_break [] 2025-12-04T10:01:25.2522922Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2522997Z Autotune Choices Stats: 2025-12-04T10:01:25.2524622Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2524916Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2525158Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2525517Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2526815Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2528107Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2529388Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2530680Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2531992Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2533361Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2533638Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:25.2533711Z Autotune Choices Stats: 2025-12-04T10:01:25.2535423Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.2535950Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2536313Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2536971Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2538311Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2539648Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2540986Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2542468Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2543831Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2545164Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2546491Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2547871Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2549201Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2550538Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2550815Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:25.2550996Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2551101Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2551165Z unimplemented [] 2025-12-04T10:01:25.2551274Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2551481Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2552900Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2552971Z graph_break [] 2025-12-04T10:01:25.2553106Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2553183Z Autotune Choices Stats: 2025-12-04T10:01:25.2554806Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2555097Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2555624Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2556003Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2557311Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2558605Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2559893Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2561189Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2562621Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2563954Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2564241Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:25.2564318Z Autotune Choices Stats: 2025-12-04T10:01:25.2565974Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.2566504Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2566867Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2567531Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2568884Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2570228Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2571613Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2573015Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2574379Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2575706Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2577044Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2578376Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2579705Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2581028Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2581374Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:25.2581521Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2581593Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2581656Z unimplemented [] 2025-12-04T10:01:25.2581816Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2582030Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2583459Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2583530Z graph_break [] 2025-12-04T10:01:25.2583664Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2583737Z Autotune Choices Stats: 2025-12-04T10:01:25.2585348Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2585642Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2585880Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2586246Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2587624Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2588906Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2590193Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2591577Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2592889Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2594181Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2594462Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:25.2594535Z Autotune Choices Stats: 2025-12-04T10:01:25.2596178Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.2596700Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2597058Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2597707Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2599046Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2600383Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2601831Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2603221Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2604547Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2605881Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2607210Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2608538Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2609888Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2611292Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2611617Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:25.2611756Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2611832Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2611897Z unimplemented [] 2025-12-04T10:01:25.2612010Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2612249Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2613648Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2613721Z graph_break [] 2025-12-04T10:01:25.2613852Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2613925Z Autotune Choices Stats: 2025-12-04T10:01:25.2615538Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2615831Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2616071Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2616440Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2617743Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2619034Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2620356Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2621706Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2623028Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2624318Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2624601Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:25.2624673Z Autotune Choices Stats: 2025-12-04T10:01:25.2626325Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.2626854Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2627259Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2627916Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2629257Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2630694Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2632064Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2633404Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2634728Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2636046Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2637383Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2638710Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2640087Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2641490Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2641773Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:25.2641941Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2642017Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2642083Z unimplemented [] 2025-12-04T10:01:25.2642193Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2642398Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2643794Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2643862Z graph_break [] 2025-12-04T10:01:25.2643999Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2644072Z Autotune Choices Stats: 2025-12-04T10:01:25.2645671Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2645959Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2646209Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2646563Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2647859Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2649182Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2650530Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2651855Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2653133Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2654425Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2654704Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:25.2654776Z Autotune Choices Stats: 2025-12-04T10:01:25.2656731Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.2657268Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2657631Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2658281Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2659702Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2661128Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2662509Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2663847Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2665175Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2666496Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2667921Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2669245Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2670674Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2672042Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2672325Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:25.2672463Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2672542Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2672604Z unimplemented [] 2025-12-04T10:01:25.2672714Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2672918Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2674310Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2674380Z graph_break [] 2025-12-04T10:01:25.2674514Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2674587Z Autotune Choices Stats: 2025-12-04T10:01:25.2676191Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2676483Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2676723Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2677078Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2678383Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2679774Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2681087Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2682382Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2683661Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2684942Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2685229Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:25.2685305Z Autotune Choices Stats: 2025-12-04T10:01:25.2691824Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:25.2692403Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2692854Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2693539Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2694935Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2696324Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2697657Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2698981Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2700305Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2701631Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2702955Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2704421Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2705776Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2707104Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2707489Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:25.2707642Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2707724Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2707792Z unimplemented [] 2025-12-04T10:01:25.2707908Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2708137Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2709537Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2709607Z graph_break [] 2025-12-04T10:01:25.2709748Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2709825Z Autotune Choices Stats: 2025-12-04T10:01:25.2711440Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2711747Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2711992Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2712394Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2713737Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2715067Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2716357Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2717645Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2718947Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2720233Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2720523Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:25.2720599Z Autotune Choices Stats: 2025-12-04T10:01:25.2722234Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.2722822Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2723215Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2723895Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2725270Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2726610Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2727932Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2729263Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2730594Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2731922Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2733280Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2734653Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2736010Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2737331Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2737622Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:25.2737764Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2737842Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2737907Z unimplemented [] 2025-12-04T10:01:25.2738016Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2738236Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2739631Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2739701Z graph_break [] 2025-12-04T10:01:25.2739835Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2739904Z Autotune Choices Stats: 2025-12-04T10:01:25.2741516Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2741847Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2742127Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2742484Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2743817Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2745135Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2746421Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2747762Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2749054Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2750340Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2750626Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:25.2750700Z Autotune Choices Stats: 2025-12-04T10:01:25.2752343Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.2752941Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2753336Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2754030Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2755671Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2757023Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2758346Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2759681Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2761022Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2762347Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2763844Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2765205Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2766546Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2767864Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2768154Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:25.2768297Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2768375Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2768451Z unimplemented [] 2025-12-04T10:01:25.2768562Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2768783Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2770179Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2770248Z graph_break [] 2025-12-04T10:01:25.2770387Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2770459Z Autotune Choices Stats: 2025-12-04T10:01:25.2772062Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2772425Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2772714Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2773071Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2774404Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2775676Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2776967Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2778253Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2779521Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2780800Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2781119Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:25.2781228Z Autotune Choices Stats: 2025-12-04T10:01:25.2782911Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.2783434Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2783831Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2784480Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2785813Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2787154Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2788553Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2789882Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2791198Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2792636Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2793992Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2795335Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2796669Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2797994Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2798288Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:25.2798426Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2798506Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2798569Z unimplemented [] 2025-12-04T10:01:25.2798676Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2798888Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2800276Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2800385Z graph_break [] 2025-12-04T10:01:25.2800521Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2800623Z Autotune Choices Stats: 2025-12-04T10:01:25.2802266Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.2802552Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2802799Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2803184Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2804482Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2805755Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2807035Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2808310Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2809592Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2810870Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2811213Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:25.2811287Z Autotune Choices Stats: 2025-12-04T10:01:25.2812981Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:25.2813511Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2813871Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2814503Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2815848Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2817183Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2818507Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2819830Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2821214Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2822603Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2823936Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2825259Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2826582Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2827950Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2828239Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:25.2828375Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2828450Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2828515Z unimplemented [] 2025-12-04T10:01:25.2828626Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2828838Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2830226Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2830381Z graph_break [] 2025-12-04T10:01:25.2830516Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2830583Z Autotune Choices Stats: 2025-12-04T10:01:25.2832237Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.2832525Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2832774Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2833127Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2834425Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2835703Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2836980Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2838252Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2839529Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2840902Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2841178Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:25.2841246Z Autotune Choices Stats: 2025-12-04T10:01:25.2842931Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.2843464Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2843841Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2844481Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2845828Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2847174Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2848501Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2849865Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2851249Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2852650Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2853974Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2855542Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2856905Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2858244Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2858536Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:25.2858674Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2858824Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2858890Z unimplemented [] 2025-12-04T10:01:25.2858998Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2859258Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2860687Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2860756Z graph_break [] 2025-12-04T10:01:25.2860890Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2860957Z Autotune Choices Stats: 2025-12-04T10:01:25.2862622Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.2862919Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2863170Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2863522Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2864820Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2866097Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2867460Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2868734Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2870103Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2871415Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2871729Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:25.2871800Z Autotune Choices Stats: 2025-12-04T10:01:25.2873443Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.2873972Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2874335Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2874976Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2876312Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2877657Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2878980Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2880398Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2881744Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2883075Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2884397Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2885708Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2887031Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2888350Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2888676Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:25.2888845Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2888923Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2888987Z unimplemented [] 2025-12-04T10:01:25.2889092Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2889311Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2890726Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2890798Z graph_break [] 2025-12-04T10:01:25.2890963Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2891034Z Autotune Choices Stats: 2025-12-04T10:01:25.2892625Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.2892910Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2893158Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2893510Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2894795Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2896061Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2897340Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2898637Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2899978Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2901303Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2901586Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:25.2901656Z Autotune Choices Stats: 2025-12-04T10:01:25.2903294Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:25.2903816Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2904188Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2904819Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2906158Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2907534Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2908962Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2910313Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2911633Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2912958Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2914284Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2915622Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2916958Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2918306Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2918634Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:25.2918804Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2918884Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2918948Z unimplemented [] 2025-12-04T10:01:25.2919054Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2919265Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2920688Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2920761Z graph_break [] 2025-12-04T10:01:25.2920895Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2920967Z Autotune Choices Stats: 2025-12-04T10:01:25.2922580Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.2922871Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2923119Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2923474Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2924775Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2926053Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2927340Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2928711Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2930027Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2931323Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2931601Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:25.2931670Z Autotune Choices Stats: 2025-12-04T10:01:25.2933311Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.2933831Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2934200Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2934840Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2936179Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2937547Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2938954Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2940322Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2941643Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2942981Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2944297Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2945618Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2946949Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2948398Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2948684Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:25.2948821Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2948890Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2948961Z unimplemented [] 2025-12-04T10:01:25.2949100Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2949308Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2950698Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2950765Z graph_break [] 2025-12-04T10:01:25.2950901Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2950969Z Autotune Choices Stats: 2025-12-04T10:01:25.2952560Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_574", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.2952855Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2953102Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2953457Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2954747Z triton_flex_attention_574 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2956158Z triton_flex_attention_575 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2957893Z triton_flex_attention_572 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2959229Z triton_flex_attention_570 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2960515Z triton_flex_attention_573 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2961793Z triton_flex_attention_571 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2962076Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3286 seconds precompiling for 6 choices 2025-12-04T10:01:25.2962147Z Autotune Choices Stats: 2025-12-04T10:01:25.2963808Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014303999952971935, "best_triton_pos": 0} 2025-12-04T10:01:25.2964340Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2964711Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2965345Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2966689Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2968110Z triton_flex_attention_backward_576 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2969466Z triton_flex_attention_backward_577 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2970793Z triton_flex_attention_backward_578 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2972119Z triton_flex_attention_backward_581 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2973442Z triton_flex_attention_backward_583 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.2974766Z triton_flex_attention_backward_580 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2976100Z triton_flex_attention_backward_582 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.2977540Z triton_flex_attention_backward_585 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2978932Z triton_flex_attention_backward_588 0.0174 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.2979229Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2260 seconds precompiling for 13 choices 2025-12-04T10:01:25.2979371Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.2979444Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.2979517Z unimplemented [] 2025-12-04T10:01:25.2979625Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.2979841Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.2981225Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.2981296Z graph_break [] 2025-12-04T10:01:25.2981432Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.2981503Z Autotune Choices Stats: 2025-12-04T10:01:25.2983105Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.2983397Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2983647Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2983999Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2985293Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2986628Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2988039Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.2989359Z triton_flex_attention_589 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2990644Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.2991925Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2992207Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.2961 seconds precompiling for 6 choices 2025-12-04T10:01:25.2992275Z Autotune Choices Stats: 2025-12-04T10:01:25.2993925Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.2994447Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.2994818Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.2995498Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.2996906Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.2998269Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.2999602Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3000925Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3002248Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3003568Z triton_flex_attention_backward_601 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3004887Z triton_flex_attention_backward_602 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3006252Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3007680Z triton_flex_attention_backward_604 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3009033Z triton_flex_attention_backward_603 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3009324Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.3242 seconds precompiling for 13 choices 2025-12-04T10:01:25.3009460Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3009529Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3009599Z unimplemented [] 2025-12-04T10:01:25.3009706Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3009914Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3011303Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3011366Z graph_break [] 2025-12-04T10:01:25.3011506Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3011574Z Autotune Choices Stats: 2025-12-04T10:01:25.3013170Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.3013459Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3013704Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3014058Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3015414Z triton_flex_attention_612 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3016757Z triton_flex_attention_613 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3018060Z triton_flex_attention_608 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3019337Z triton_flex_attention_610 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3020618Z triton_flex_attention_611 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3021895Z triton_flex_attention_609 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3022180Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3064 seconds precompiling for 6 choices 2025-12-04T10:01:25.3022251Z Autotune Choices Stats: 2025-12-04T10:01:25.3023899Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_616", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.3024414Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3024834Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3025512Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3026897Z triton_flex_attention_backward_616 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3028315Z triton_flex_attention_backward_615 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3029658Z triton_flex_attention_backward_617 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3030984Z triton_flex_attention_backward_614 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3032301Z triton_flex_attention_backward_619 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3033626Z triton_flex_attention_backward_620 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3034949Z triton_flex_attention_backward_618 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3036367Z triton_flex_attention_backward_621 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3037715Z triton_flex_attention_backward_623 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3039033Z triton_flex_attention_backward_622 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3039321Z SingleProcess AUTOTUNE benchmarking takes 0.6701 seconds and 2.3164 seconds precompiling for 13 choices 2025-12-04T10:01:25.3039458Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3039530Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3039601Z unimplemented [] 2025-12-04T10:01:25.3039708Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3039914Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3041301Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3041364Z graph_break [] 2025-12-04T10:01:25.3041517Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3041587Z Autotune Choices Stats: 2025-12-04T10:01:25.3043183Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.3043466Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3043756Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3044150Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3045477Z triton_flex_attention_631 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3046804Z triton_flex_attention_632 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3048081Z triton_flex_attention_629 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3049357Z triton_flex_attention_627 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3050641Z triton_flex_attention_630 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3051913Z triton_flex_attention_628 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3052203Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.2948 seconds precompiling for 6 choices 2025-12-04T10:01:25.3052271Z Autotune Choices Stats: 2025-12-04T10:01:25.3053908Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.3054511Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3054923Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3055816Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3057230Z triton_flex_attention_backward_634 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3058563Z triton_flex_attention_backward_635 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3059896Z triton_flex_attention_backward_636 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3061224Z triton_flex_attention_backward_633 0.0144 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3062546Z triton_flex_attention_backward_638 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3063874Z triton_flex_attention_backward_640 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3065326Z triton_flex_attention_backward_637 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3066685Z triton_flex_attention_backward_639 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3068065Z triton_flex_attention_backward_642 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3069382Z triton_flex_attention_backward_641 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3069668Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.2589 seconds precompiling for 13 choices 2025-12-04T10:01:25.3069804Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3069874Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3069947Z unimplemented [] 2025-12-04T10:01:25.3070053Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3070259Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3071650Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3071714Z graph_break [] 2025-12-04T10:01:25.3071852Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3071920Z Autotune Choices Stats: 2025-12-04T10:01:25.3073519Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.3073876Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3074127Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3074517Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3075833Z triton_flex_attention_650 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3077120Z triton_flex_attention_651 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3078392Z triton_flex_attention_646 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3079669Z triton_flex_attention_648 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3080947Z triton_flex_attention_649 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3082226Z triton_flex_attention_647 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3082510Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3235 seconds precompiling for 6 choices 2025-12-04T10:01:25.3082627Z Autotune Choices Stats: 2025-12-04T10:01:25.3084270Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_653", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.3084876Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3085240Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3085906Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3087247Z triton_flex_attention_backward_653 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3088570Z triton_flex_attention_backward_654 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3089892Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3091232Z triton_flex_attention_backward_652 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3092565Z triton_flex_attention_backward_656 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3093929Z triton_flex_attention_backward_657 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3095326Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3096676Z triton_flex_attention_backward_658 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3098002Z triton_flex_attention_backward_661 0.0164 ms 80.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3099324Z triton_flex_attention_backward_660 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3099612Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.4000 seconds precompiling for 13 choices 2025-12-04T10:01:25.3099750Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3099830Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3099894Z unimplemented [] 2025-12-04T10:01:25.3100001Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3100216Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3101607Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3101675Z graph_break [] 2025-12-04T10:01:25.3101808Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3101917Z Autotune Choices Stats: 2025-12-04T10:01:25.3103518Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.3103873Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3104123Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3104476Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3105803Z triton_flex_attention_669 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3107084Z triton_flex_attention_670 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3108456Z triton_flex_attention_665 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3109736Z triton_flex_attention_667 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3111018Z triton_flex_attention_668 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3112300Z triton_flex_attention_666 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3112648Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.2940 seconds precompiling for 6 choices 2025-12-04T10:01:25.3112717Z Autotune Choices Stats: 2025-12-04T10:01:25.3114388Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.3114944Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3115319Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3115957Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3117306Z triton_flex_attention_backward_673 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3118652Z triton_flex_attention_backward_674 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3119970Z triton_flex_attention_backward_671 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3121307Z triton_flex_attention_backward_672 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3122671Z triton_flex_attention_backward_676 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3124089Z triton_flex_attention_backward_675 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3125442Z triton_flex_attention_backward_677 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3126771Z triton_flex_attention_backward_678 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3128092Z triton_flex_attention_backward_680 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3129403Z triton_flex_attention_backward_683 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3129692Z SingleProcess AUTOTUNE benchmarking takes 0.6679 seconds and 2.3879 seconds precompiling for 13 choices 2025-12-04T10:01:25.3129828Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3129902Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3129968Z unimplemented [] 2025-12-04T10:01:25.3130072Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3130277Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3131670Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3131806Z graph_break [] 2025-12-04T10:01:25.3131940Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3132021Z Autotune Choices Stats: 2025-12-04T10:01:25.3133662Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_689", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010367999784648418, "best_triton_pos": 0} 2025-12-04T10:01:25.3133951Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3134233Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3134588Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3135883Z triton_flex_attention_689 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3137161Z triton_flex_attention_688 0.0113 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3138433Z triton_flex_attention_684 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3139708Z triton_flex_attention_686 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3140993Z triton_flex_attention_687 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3142308Z triton_flex_attention_685 0.0164 ms 63.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3142660Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3120 seconds precompiling for 6 choices 2025-12-04T10:01:25.3142731Z Autotune Choices Stats: 2025-12-04T10:01:25.3144397Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_690", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.3144922Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3145292Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3145931Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3147322Z triton_flex_attention_backward_690 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3148655Z triton_flex_attention_backward_691 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3149980Z triton_flex_attention_backward_692 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3151311Z triton_flex_attention_backward_693 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3152738Z triton_flex_attention_backward_695 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3154092Z triton_flex_attention_backward_696 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3155698Z triton_flex_attention_backward_697 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3157044Z triton_flex_attention_backward_694 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3158378Z triton_flex_attention_backward_699 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3159698Z triton_flex_attention_backward_702 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3159982Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.3417 seconds precompiling for 13 choices 2025-12-04T10:01:25.3160121Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3160199Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3160262Z unimplemented [] 2025-12-04T10:01:25.3160368Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3160669Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3162065Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3162229Z graph_break [] 2025-12-04T10:01:25.3162411Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3162482Z Autotune Choices Stats: 2025-12-04T10:01:25.3164144Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.3164441Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3164689Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3165043Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3166347Z triton_flex_attention_707 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3167638Z triton_flex_attention_708 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3168916Z triton_flex_attention_705 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3170196Z triton_flex_attention_706 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3171515Z triton_flex_attention_703 0.0144 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3172868Z triton_flex_attention_704 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3173149Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.2972 seconds precompiling for 6 choices 2025-12-04T10:01:25.3173218Z Autotune Choices Stats: 2025-12-04T10:01:25.3174895Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_709", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.3175417Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3175788Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3176438Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3177782Z triton_flex_attention_backward_709 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3179112Z triton_flex_attention_backward_710 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3180439Z triton_flex_attention_backward_711 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3181834Z triton_flex_attention_backward_712 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3183222Z triton_flex_attention_backward_714 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3184561Z triton_flex_attention_backward_715 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3185885Z triton_flex_attention_backward_713 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3187299Z triton_flex_attention_backward_716 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3188633Z triton_flex_attention_backward_721 0.0174 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3189956Z triton_flex_attention_backward_717 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3190241Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.1509 seconds precompiling for 13 choices 2025-12-04T10:01:25.3190432Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3190541Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3190615Z unimplemented [] 2025-12-04T10:01:25.3190723Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3190937Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3192361Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3192434Z graph_break [] 2025-12-04T10:01:25.3192571Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3192638Z Autotune Choices Stats: 2025-12-04T10:01:25.3194294Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.3194583Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3194837Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3195196Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3196510Z triton_flex_attention_726 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3197780Z triton_flex_attention_727 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3199064Z triton_flex_attention_722 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3200336Z triton_flex_attention_724 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3201714Z triton_flex_attention_725 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3203022Z triton_flex_attention_723 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3203308Z SingleProcess AUTOTUNE benchmarking takes 0.2941 seconds and 1.2814 seconds precompiling for 6 choices 2025-12-04T10:01:25.3203380Z Autotune Choices Stats: 2025-12-04T10:01:25.3205030Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_729", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.3205554Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3205927Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3206566Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3207903Z triton_flex_attention_backward_729 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3209221Z triton_flex_attention_backward_730 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3210581Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3211970Z triton_flex_attention_backward_733 0.0144 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3213323Z triton_flex_attention_backward_728 0.0153 ms 87.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3214642Z triton_flex_attention_backward_735 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3215987Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3217312Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3218639Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3219962Z triton_flex_attention_backward_740 0.0173 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3220316Z SingleProcess AUTOTUNE benchmarking takes 0.6682 seconds and 2.2393 seconds precompiling for 13 choices 2025-12-04T10:01:25.3220453Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3220522Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3220600Z unimplemented [] 2025-12-04T10:01:25.3220708Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3220952Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3222379Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3222448Z graph_break [] 2025-12-04T10:01:25.3222586Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3222655Z Autotune Choices Stats: 2025-12-04T10:01:25.3224262Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.3224551Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3224796Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3225159Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3226468Z triton_flex_attention_745 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3227798Z triton_flex_attention_746 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3229091Z triton_flex_attention_743 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3230445Z triton_flex_attention_741 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3231811Z triton_flex_attention_744 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3233088Z triton_flex_attention_742 0.0164 ms 62.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3233375Z SingleProcess AUTOTUNE benchmarking takes 0.2954 seconds and 1.3187 seconds precompiling for 6 choices 2025-12-04T10:01:25.3233443Z Autotune Choices Stats: 2025-12-04T10:01:25.3235093Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_750", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.3235617Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3235984Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3236630Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3237979Z triton_flex_attention_backward_750 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3239300Z triton_flex_attention_backward_748 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3240740Z triton_flex_attention_backward_749 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3242087Z triton_flex_attention_backward_753 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3243405Z triton_flex_attention_backward_747 0.0144 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3244745Z triton_flex_attention_backward_752 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3246078Z triton_flex_attention_backward_754 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3247404Z triton_flex_attention_backward_751 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3248723Z triton_flex_attention_backward_756 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3250084Z triton_flex_attention_backward_759 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3250452Z SingleProcess AUTOTUNE benchmarking takes 0.6710 seconds and 2.3823 seconds precompiling for 13 choices 2025-12-04T10:01:25.3250590Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3250660Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3250736Z unimplemented [] 2025-12-04T10:01:25.3250842Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3251055Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3252482Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3252547Z graph_break [] 2025-12-04T10:01:25.3252686Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3252755Z Autotune Choices Stats: 2025-12-04T10:01:25.3254355Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_765", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010304000228643417, "best_triton_pos": 0} 2025-12-04T10:01:25.3254643Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3254890Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3255464Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3256891Z triton_flex_attention_765 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3258178Z triton_flex_attention_764 0.0113 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3259541Z triton_flex_attention_762 0.0133 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3260929Z triton_flex_attention_760 0.0143 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3262271Z triton_flex_attention_763 0.0143 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3263551Z triton_flex_attention_761 0.0154 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3263839Z SingleProcess AUTOTUNE benchmarking takes 0.2951 seconds and 1.3301 seconds precompiling for 6 choices 2025-12-04T10:01:25.3263909Z Autotune Choices Stats: 2025-12-04T10:01:25.3265574Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_767", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.3266096Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3266475Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3267113Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3268538Z triton_flex_attention_backward_767 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3269994Z triton_flex_attention_backward_769 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3271347Z triton_flex_attention_backward_766 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3272676Z triton_flex_attention_backward_768 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3273986Z triton_flex_attention_backward_771 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3275316Z triton_flex_attention_backward_772 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3276639Z triton_flex_attention_backward_770 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3277973Z triton_flex_attention_backward_773 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3279327Z triton_flex_attention_backward_775 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3280704Z triton_flex_attention_backward_778 0.0174 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3280994Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2444 seconds precompiling for 13 choices 2025-12-04T10:01:25.3281163Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3281234Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3281314Z unimplemented [] 2025-12-04T10:01:25.3281422Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3281624Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3283028Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3283091Z graph_break [] 2025-12-04T10:01:25.3283234Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3283302Z Autotune Choices Stats: 2025-12-04T10:01:25.3284907Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_783", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.3285190Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3285440Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3285796Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3287086Z triton_flex_attention_783 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3288372Z triton_flex_attention_784 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3289774Z triton_flex_attention_779 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3291095Z triton_flex_attention_781 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3292382Z triton_flex_attention_782 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3293661Z triton_flex_attention_780 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3293949Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3189 seconds precompiling for 6 choices 2025-12-04T10:01:25.3294017Z Autotune Choices Stats: 2025-12-04T10:01:25.3295671Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_786", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.3296193Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3296561Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3297196Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3298574Z triton_flex_attention_backward_786 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3299965Z triton_flex_attention_backward_787 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3301320Z triton_flex_attention_backward_788 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3302641Z triton_flex_attention_backward_785 0.0145 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3303973Z triton_flex_attention_backward_790 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3305308Z triton_flex_attention_backward_791 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3306639Z triton_flex_attention_backward_792 0.0155 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3308035Z triton_flex_attention_backward_789 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3309497Z triton_flex_attention_backward_794 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3310846Z triton_flex_attention_backward_797 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3311135Z SingleProcess AUTOTUNE benchmarking takes 0.6703 seconds and 2.2711 seconds precompiling for 13 choices 2025-12-04T10:01:25.3311268Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3311337Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3311408Z unimplemented [] 2025-12-04T10:01:25.3311519Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3311721Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3313128Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3313193Z graph_break [] 2025-12-04T10:01:25.3313330Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3313403Z Autotune Choices Stats: 2025-12-04T10:01:25.3315008Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_803", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.3315292Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3315534Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3315886Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3317193Z triton_flex_attention_803 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3318632Z triton_flex_attention_802 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3319959Z triton_flex_attention_800 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3321237Z triton_flex_attention_798 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3322522Z triton_flex_attention_801 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3323801Z triton_flex_attention_799 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3324089Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T10:01:25.3324159Z Autotune Choices Stats: 2025-12-04T10:01:25.3325818Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_806", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.3326351Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3326716Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3327420Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3328797Z triton_flex_attention_backward_806 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3330159Z triton_flex_attention_backward_805 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3331482Z triton_flex_attention_backward_807 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3332813Z triton_flex_attention_backward_804 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3334137Z triton_flex_attention_backward_809 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3335462Z triton_flex_attention_backward_810 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3336787Z triton_flex_attention_backward_811 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3338226Z triton_flex_attention_backward_808 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3339595Z triton_flex_attention_backward_812 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3340923Z triton_flex_attention_backward_813 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3341210Z SingleProcess AUTOTUNE benchmarking takes 0.6698 seconds and 2.2839 seconds precompiling for 13 choices 2025-12-04T10:01:25.3341344Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3341415Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3341482Z unimplemented [] 2025-12-04T10:01:25.3341589Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3341791Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3343200Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3343262Z graph_break [] 2025-12-04T10:01:25.3343402Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3343471Z Autotune Choices Stats: 2025-12-04T10:01:25.3345070Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_821", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.3345357Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3345600Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3345991Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3347371Z triton_flex_attention_821 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3348724Z triton_flex_attention_822 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3350036Z triton_flex_attention_817 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3351318Z triton_flex_attention_819 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3352615Z triton_flex_attention_820 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3353896Z triton_flex_attention_818 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3354186Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3176 seconds precompiling for 6 choices 2025-12-04T10:01:25.3354256Z Autotune Choices Stats: 2025-12-04T10:01:25.3356193Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_825", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.3356791Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3357222Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3357908Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3359307Z triton_flex_attention_backward_825 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3360630Z triton_flex_attention_backward_824 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3361963Z triton_flex_attention_backward_826 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3363285Z triton_flex_attention_backward_823 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3364613Z triton_flex_attention_backward_828 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3365939Z triton_flex_attention_backward_827 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3367302Z triton_flex_attention_backward_829 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3368692Z triton_flex_attention_backward_830 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3370045Z triton_flex_attention_backward_832 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3371373Z triton_flex_attention_backward_835 0.0164 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3371663Z SingleProcess AUTOTUNE benchmarking takes 0.6673 seconds and 2.2875 seconds precompiling for 13 choices 2025-12-04T10:01:25.3371801Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3371875Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3371947Z unimplemented [] 2025-12-04T10:01:25.3372055Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3372261Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3373657Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3373722Z graph_break [] 2025-12-04T10:01:25.3373862Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3373930Z Autotune Choices Stats: 2025-12-04T10:01:25.3375541Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_840", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.3375870Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3376162Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3376526Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3377865Z triton_flex_attention_840 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3379187Z triton_flex_attention_841 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3380471Z triton_flex_attention_836 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3381749Z triton_flex_attention_838 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3383036Z triton_flex_attention_839 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3384314Z triton_flex_attention_837 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3384600Z SingleProcess AUTOTUNE benchmarking takes 0.2950 seconds and 1.3350 seconds precompiling for 6 choices 2025-12-04T10:01:25.3384668Z Autotune Choices Stats: 2025-12-04T10:01:25.3386316Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_843", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.3386899Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3387389Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3388064Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3389411Z triton_flex_attention_backward_843 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3390744Z triton_flex_attention_backward_844 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3392079Z triton_flex_attention_backward_845 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3393409Z triton_flex_attention_backward_842 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3394745Z triton_flex_attention_backward_847 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3396065Z triton_flex_attention_backward_846 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3397487Z triton_flex_attention_backward_848 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3398848Z triton_flex_attention_backward_849 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3400174Z triton_flex_attention_backward_851 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3401501Z triton_flex_attention_backward_850 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3401784Z SingleProcess AUTOTUNE benchmarking takes 0.6676 seconds and 2.3506 seconds precompiling for 13 choices 2025-12-04T10:01:25.3401986Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:25.3402067Z Traceback (most recent call last): 2025-12-04T10:01:25.3402430Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:25.3402498Z self.assertTrue( 2025-12-04T10:01:25.3402729Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:25.3402822Z raise self.failureException(msg) 2025-12-04T10:01:25.3403099Z AssertionError: False is not true : Log file /tmp/tmp7g4b1x1n/flex_attention_configs.json was not created 2025-12-04T10:01:25.3403103Z 2025-12-04T10:01:25.3403257Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:25.3403550Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:25.3403554Z 2025-12-04T10:01:25.3403734Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:25.3403884Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3403960Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3404024Z unimplemented [] 2025-12-04T10:01:25.3404176Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3405582Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:25.3405828Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3405889Z graph_break [] 2025-12-04T10:01:25.3406064Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3407263Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:25.3407350Z current_size = base.storage().size() 2025-12-04T10:01:25.3407424Z Autotune Choices Stats: 2025-12-04T10:01:25.3409037Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:25.3409333Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3409579Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3409941Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3411232Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3412513Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3413788Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3415099Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3416468Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3417779Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3418071Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:25.3418140Z Autotune Choices Stats: 2025-12-04T10:01:25.3419788Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.3420312Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3420672Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3421317Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3422647Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3423974Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3425371Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3426750Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3428128Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3429466Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3430787Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3432109Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3433428Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3434754Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3435105Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:25.3435249Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3435352Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3435417Z unimplemented [] 2025-12-04T10:01:25.3435529Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3435731Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3437179Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3437245Z graph_break [] 2025-12-04T10:01:25.3437378Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3437450Z Autotune Choices Stats: 2025-12-04T10:01:25.3439054Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.3439348Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3439598Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3439963Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3441242Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3442524Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3443812Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3445199Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3446503Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3447786Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3448063Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:25.3448139Z Autotune Choices Stats: 2025-12-04T10:01:25.3449781Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.3450321Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3450681Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3451326Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3452657Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3454040Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3455738Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3457139Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3458467Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3459787Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3461115Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3462438Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3463764Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3465567Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3465864Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:25.3466011Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3466087Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3466158Z unimplemented [] 2025-12-04T10:01:25.3466274Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3466561Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3468048Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3468118Z graph_break [] 2025-12-04T10:01:25.3468255Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3468332Z Autotune Choices Stats: 2025-12-04T10:01:25.3469935Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.3470231Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3470474Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3470840Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3472133Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3473420Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3474779Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3476091Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3477403Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3478691Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3478972Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:25.3479045Z Autotune Choices Stats: 2025-12-04T10:01:25.3480681Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.3481205Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3481569Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3482212Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3483543Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3484967Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3486343Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3487672Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3488991Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3490305Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3491628Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3492942Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3494293Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3495682Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3495999Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:25.3496142Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3496212Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3496276Z unimplemented [] 2025-12-04T10:01:25.3496390Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3496601Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3497991Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3498063Z graph_break [] 2025-12-04T10:01:25.3498197Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3498270Z Autotune Choices Stats: 2025-12-04T10:01:25.3499861Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.3500153Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3500395Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3500750Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3502037Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3503353Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3504690Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3506002Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3507329Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3508604Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3508888Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:25.3508966Z Autotune Choices Stats: 2025-12-04T10:01:25.3510600Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.3511133Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3511496Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3512137Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3513584Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3514950Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3516292Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3517613Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3518938Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3520257Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3521582Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3522896Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3524339Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3525698Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3525986Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:25.3526125Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3526203Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3526266Z unimplemented [] 2025-12-04T10:01:25.3526377Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3526586Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3527978Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3528051Z graph_break [] 2025-12-04T10:01:25.3528187Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3528263Z Autotune Choices Stats: 2025-12-04T10:01:25.3529867Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.3530162Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3530405Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3530758Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3532056Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3533454Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3534776Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3536062Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3537342Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3538623Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3538903Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:25.3538979Z Autotune Choices Stats: 2025-12-04T10:01:25.3540633Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.3541154Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3541555Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3542227Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3543584Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3544945Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3546276Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3547660Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3548991Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3550306Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3551632Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3553046Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3554406Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3556009Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3556307Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:25.3556446Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3556522Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3556591Z unimplemented [] 2025-12-04T10:01:25.3556706Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3556914Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3558305Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3558376Z graph_break [] 2025-12-04T10:01:25.3558510Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3558582Z Autotune Choices Stats: 2025-12-04T10:01:25.3560202Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.3560493Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3560736Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3561183Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3562601Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3563928Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3565216Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3566502Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3567783Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3569060Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3569340Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:25.3569415Z Autotune Choices Stats: 2025-12-04T10:01:25.3571064Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.3571651Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3572015Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3572707Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3574079Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3575421Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3576748Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3578082Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3579407Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3580733Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3582091Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3583482Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3584842Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3586173Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3586458Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:25.3586593Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3586668Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3586743Z unimplemented [] 2025-12-04T10:01:25.3586851Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3587065Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3588525Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3588598Z graph_break [] 2025-12-04T10:01:25.3588732Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3588805Z Autotune Choices Stats: 2025-12-04T10:01:25.3590417Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.3590784Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3591026Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3591383Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3592713Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3594023Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3595315Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3596617Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3597902Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3599189Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3599469Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:25.3599546Z Autotune Choices Stats: 2025-12-04T10:01:25.3601186Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.3601828Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3602191Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3602875Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3604215Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3605544Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3606865Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3608196Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3609527Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3610895Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3612283Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3613643Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3614969Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3616313Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3616594Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:25.3616731Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3616810Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3616874Z unimplemented [] 2025-12-04T10:01:25.3616981Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3617187Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3618577Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3618646Z graph_break [] 2025-12-04T10:01:25.3618776Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3618848Z Autotune Choices Stats: 2025-12-04T10:01:25.3620449Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:25.3620813Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3621133Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3621491Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3622830Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3624116Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3625410Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3626697Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3628015Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3629313Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3629630Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:25.3629737Z Autotune Choices Stats: 2025-12-04T10:01:25.3631417Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.3631942Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3632357Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3633013Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3634356Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3635700Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3637031Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3638363Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3639690Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3641127Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3642495Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3643824Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3645148Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3646486Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3646770Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:25.3646906Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3646981Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3647046Z unimplemented [] 2025-12-04T10:01:25.3647149Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3647355Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3648740Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3648853Z graph_break [] 2025-12-04T10:01:25.3648985Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3649090Z Autotune Choices Stats: 2025-12-04T10:01:25.3650729Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:25.3651025Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3651315Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3651679Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3652969Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3654237Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3655792Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3657124Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3658411Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3659770Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3660105Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:25.3660181Z Autotune Choices Stats: 2025-12-04T10:01:25.3661925Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.3662468Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3662837Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3663490Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3664838Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3666171Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3667569Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3668904Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3670359Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3671721Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3673057Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3674398Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3675736Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3677063Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3677354Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:25.3677494Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3677576Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3677643Z unimplemented [] 2025-12-04T10:01:25.3677752Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3677968Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3679367Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3679523Z graph_break [] 2025-12-04T10:01:25.3679660Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3679729Z Autotune Choices Stats: 2025-12-04T10:01:25.3681414Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.3681712Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3681965Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3682329Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3683630Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3684916Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3686208Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3687504Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3688782Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3690171Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3690453Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:25.3690528Z Autotune Choices Stats: 2025-12-04T10:01:25.3692215Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.3692743Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3693105Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3693757Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3695094Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3696445Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3697770Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3699143Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3700536Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3701897Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3703232Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3704560Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3705900Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3707279Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3707571Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:25.3707708Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3707837Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3707901Z unimplemented [] 2025-12-04T10:01:25.3708009Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3708273Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3709688Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3709756Z graph_break [] 2025-12-04T10:01:25.3709888Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3709956Z Autotune Choices Stats: 2025-12-04T10:01:25.3711594Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:25.3711882Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3712129Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3712488Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3713798Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3715081Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3716379Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3717682Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3719035Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3720411Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3720699Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:25.3720774Z Autotune Choices Stats: 2025-12-04T10:01:25.3722420Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.3722948Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3723308Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3723958Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3725303Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3726652Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3727988Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3729428Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3730801Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3732134Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3733458Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3734801Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3736130Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3737457Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3737812Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:25.3737947Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3738030Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3738102Z unimplemented [] 2025-12-04T10:01:25.3738209Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3738421Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3739853Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3739960Z graph_break [] 2025-12-04T10:01:25.3740093Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3740160Z Autotune Choices Stats: 2025-12-04T10:01:25.3741784Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.3742075Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3742326Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3742682Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3743990Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3745285Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3746579Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3747975Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3749336Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3750664Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3750950Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:25.3751024Z Autotune Choices Stats: 2025-12-04T10:01:25.3752670Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.3753197Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3753563Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3754207Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3755840Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3757185Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3758693Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3760085Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3761421Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3762763Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3764092Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3765420Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3766748Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3768117Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3768442Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:25.3768612Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3768688Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3768750Z unimplemented [] 2025-12-04T10:01:25.3768855Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3769074Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3770498Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3770572Z graph_break [] 2025-12-04T10:01:25.3770709Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3770778Z Autotune Choices Stats: 2025-12-04T10:01:25.3772385Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:25.3772673Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3772920Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3773281Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3774584Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3775873Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3777172Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3778609Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3779954Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3781245Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3781524Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:25.3781606Z Autotune Choices Stats: 2025-12-04T10:01:25.3783255Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.3783770Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3784133Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3784778Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3786128Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3787589Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3789000Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3790383Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3791717Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3793051Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3794376Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3795714Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3797294Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3798857Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3799147Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:25.3799287Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3799364Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3799461Z unimplemented [] 2025-12-04T10:01:25.3799570Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3799780Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3801172Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3801240Z graph_break [] 2025-12-04T10:01:25.3801372Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3801442Z Autotune Choices Stats: 2025-12-04T10:01:25.3803076Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.3803361Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3803610Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3803968Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3805272Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3806560Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3813885Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3815315Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3816638Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3817953Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3818251Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:25.3818324Z Autotune Choices Stats: 2025-12-04T10:01:25.3819994Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.3820523Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3820902Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3821542Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3822933Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3824341Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3825700Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3827033Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3828464Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3829778Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3831106Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3832433Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3833859Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3835212Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3835514Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:25.3835662Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3835742Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3835817Z unimplemented [] 2025-12-04T10:01:25.3835928Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3836152Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3837548Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3837622Z graph_break [] 2025-12-04T10:01:25.3837761Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3837830Z Autotune Choices Stats: 2025-12-04T10:01:25.3839435Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.3839726Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3839974Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3840332Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3841637Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3842982Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3844301Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3845615Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3846917Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3848210Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3848493Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:25.3848561Z Autotune Choices Stats: 2025-12-04T10:01:25.3850210Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.3850738Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3851105Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3851786Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3853217Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3854576Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3856172Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3857513Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3858834Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3860158Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3861476Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3862880Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3864294Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3865656Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3865954Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:25.3866097Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3866182Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3866256Z unimplemented [] 2025-12-04T10:01:25.3866368Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3866580Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3868017Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3868086Z graph_break [] 2025-12-04T10:01:25.3868224Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3868294Z Autotune Choices Stats: 2025-12-04T10:01:25.3869890Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.3870180Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3870425Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3870782Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3872130Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3873458Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3874770Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3876055Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3877342Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3878624Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3878908Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:25.3878981Z Autotune Choices Stats: 2025-12-04T10:01:25.3880624Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.3881183Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3881586Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3882227Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3883639Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3885024Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3886350Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3887673Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3888998Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3890324Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3891693Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3893174Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3894536Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3895860Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3896144Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:25.3896294Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3896371Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3896435Z unimplemented [] 2025-12-04T10:01:25.3896546Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3896755Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3898155Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3898225Z graph_break [] 2025-12-04T10:01:25.3898362Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3898436Z Autotune Choices Stats: 2025-12-04T10:01:25.3900037Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.3900367Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3900612Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3900996Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3902326Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3903634Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3904926Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3906215Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3907615Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3908904Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3909187Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:25.3909262Z Autotune Choices Stats: 2025-12-04T10:01:25.3910905Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.3911525Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3911923Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3912569Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3913944Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3915287Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3916621Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3917948Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3919274Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3920594Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3922019Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3923364Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3924689Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3926013Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3926298Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:25.3926431Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3926518Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3926582Z unimplemented [] 2025-12-04T10:01:25.3926695Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3926900Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3928301Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3928371Z graph_break [] 2025-12-04T10:01:25.3928506Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3928579Z Autotune Choices Stats: 2025-12-04T10:01:25.3930182Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.3930563Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3930803Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3931190Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3932517Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3933813Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3935096Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3936378Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3937646Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3938934Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3939214Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:25.3939328Z Autotune Choices Stats: 2025-12-04T10:01:25.3940967Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.3941567Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3941962Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3942612Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3943948Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3945281Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3946614Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3948006Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3949339Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3950725Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3952079Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3953432Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3954752Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3956282Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3956577Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:25.3956716Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3956799Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3956864Z unimplemented [] 2025-12-04T10:01:25.3956983Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3957190Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3958586Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3958657Z graph_break [] 2025-12-04T10:01:25.3958793Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3958956Z Autotune Choices Stats: 2025-12-04T10:01:25.3960576Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.3960967Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3961208Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3961561Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3962919Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3964205Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3965516Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3966807Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3968091Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3969384Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3969729Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:25.3969802Z Autotune Choices Stats: 2025-12-04T10:01:25.3971465Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.3972023Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3972386Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3973040Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3974378Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3975718Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3977042Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3978376Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3979745Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3981454Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.3982833Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3984149Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.3985469Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.3986792Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.3987079Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:25.3987288Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.3987374Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.3987439Z unimplemented [] 2025-12-04T10:01:25.3987548Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.3987762Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.3989163Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.3989304Z graph_break [] 2025-12-04T10:01:25.3989441Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.3989514Z Autotune Choices Stats: 2025-12-04T10:01:25.3991141Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.3991470Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.3991713Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.3992068Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.3993368Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3994648Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3995928Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.3997220Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.3998496Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.3999812Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4000177Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:25.4000252Z Autotune Choices Stats: 2025-12-04T10:01:25.4001907Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.4002456Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4002819Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4003469Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4004803Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4006133Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4007456Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4008778Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4010200Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4011548Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4012874Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4014197Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4015521Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4016844Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.4017134Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:25.4017274Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4017352Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4017415Z unimplemented [] 2025-12-04T10:01:25.4017520Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4017769Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4019162Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4019264Z graph_break [] 2025-12-04T10:01:25.4019430Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4019508Z Autotune Choices Stats: 2025-12-04T10:01:25.4021126Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.4021441Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4021690Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4022043Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4023336Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4024609Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4025892Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4027195Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4028560Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4029902Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4030186Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:25.4030262Z Autotune Choices Stats: 2025-12-04T10:01:25.4031945Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.4032472Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4032836Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4033479Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4034817Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4036147Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4037467Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4038901Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4040257Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4041587Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4042912Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4044229Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4045558Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4046887Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.4047221Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:25.4047356Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4047506Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4047569Z unimplemented [] 2025-12-04T10:01:25.4047672Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4047882Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4049306Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4049378Z graph_break [] 2025-12-04T10:01:25.4049509Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4049577Z Autotune Choices Stats: 2025-12-04T10:01:25.4051228Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.4051521Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4051761Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4052113Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4053408Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4054679Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4056108Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4057405Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4058835Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4060159Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4060444Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:25.4060518Z Autotune Choices Stats: 2025-12-04T10:01:25.4062169Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:25.4062697Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4063061Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4063706Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4065043Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4066375Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4067817Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4069208Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4070550Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4071869Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4073192Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4074512Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4075841Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4077167Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.4077547Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:25.4077683Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4077759Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4077823Z unimplemented [] 2025-12-04T10:01:25.4077960Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4078179Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4079594Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4079668Z graph_break [] 2025-12-04T10:01:25.4079800Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4079868Z Autotune Choices Stats: 2025-12-04T10:01:25.4081476Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.4081770Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4082014Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4082366Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4083663Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4084942Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4086234Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4087620Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4088929Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4090213Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4090494Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:25.4090571Z Autotune Choices Stats: 2025-12-04T10:01:25.4092217Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.4092758Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4093122Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4093773Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4095113Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4096462Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4097880Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4099242Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4100564Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4101897Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4103234Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4104553Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4105875Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4107315Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4107644Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:25.4107784Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4107860Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4107923Z unimplemented [] 2025-12-04T10:01:25.4108027Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4108282Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4109680Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4109750Z graph_break [] 2025-12-04T10:01:25.4109886Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4109953Z Autotune Choices Stats: 2025-12-04T10:01:25.4111554Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.4111842Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4112086Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4112437Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4113736Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4115006Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4116328Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4117663Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4118991Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4120278Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4120559Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:25.4120632Z Autotune Choices Stats: 2025-12-04T10:01:25.4122267Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.4122800Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4123162Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4123800Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4125137Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4126560Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4127915Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4129234Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4130561Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4131885Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4133215Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4134542Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4135900Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4137294Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.4137619Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:25.4137758Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4137834Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4137897Z unimplemented [] 2025-12-04T10:01:25.4138002Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4138209Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4139600Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4139670Z graph_break [] 2025-12-04T10:01:25.4139805Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4139873Z Autotune Choices Stats: 2025-12-04T10:01:25.4141482Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.4141765Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4142014Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4142364Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4143671Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4144987Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4146353Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4147702Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4149002Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4150289Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4150571Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:25.4150640Z Autotune Choices Stats: 2025-12-04T10:01:25.4152297Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.4152827Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4153191Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4153833Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4155412Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4156933Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4158317Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4159648Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4160978Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4162299Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4163632Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4164952Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4166393Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4167750Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.4168048Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:25.4168190Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4168274Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4168351Z unimplemented [] 2025-12-04T10:01:25.4168464Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4168682Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4170077Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4170148Z graph_break [] 2025-12-04T10:01:25.4170285Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4170357Z Autotune Choices Stats: 2025-12-04T10:01:25.4171976Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.4172271Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4172524Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4172889Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4174195Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4175590Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4176918Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4178199Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4179479Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4180770Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4181052Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:25.4181122Z Autotune Choices Stats: 2025-12-04T10:01:25.4182770Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:25.4183289Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4183703Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4184401Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4185776Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4187146Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4188571Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4189918Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4191244Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4192571Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4193897Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4195331Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4196710Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4198030Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.4198320Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:25.4198470Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4198547Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4198612Z unimplemented [] 2025-12-04T10:01:25.4198720Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4198932Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4200324Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4200390Z graph_break [] 2025-12-04T10:01:25.4200527Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4200596Z Autotune Choices Stats: 2025-12-04T10:01:25.4202210Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.4202496Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4202744Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4203222Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4204592Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4205891Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4207210Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4208500Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4209789Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4211074Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4211356Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:25.4211430Z Autotune Choices Stats: 2025-12-04T10:01:25.4213078Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.4213642Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4214049Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4214733Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4216109Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4217453Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4218783Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4220121Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4221448Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4222796Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4224177Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4225568Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4226945Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4228372Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4228664Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:25.4228801Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4228878Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4228941Z unimplemented [] 2025-12-04T10:01:25.4229050Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4229267Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4230655Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4230728Z graph_break [] 2025-12-04T10:01:25.4230864Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4230932Z Autotune Choices Stats: 2025-12-04T10:01:25.4232545Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.4232880Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4233195Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4233553Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4234888Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4236207Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4237501Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4238778Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4240073Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4241357Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4241636Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:25.4241710Z Autotune Choices Stats: 2025-12-04T10:01:25.4243353Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.4243988Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4244354Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4245029Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4246379Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4247717Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4249039Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4250378Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4251701Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4253026Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4254456Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4256135Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4257498Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4258833Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4259121Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:25.4259262Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4259337Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4259402Z unimplemented [] 2025-12-04T10:01:25.4259511Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4259721Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4261114Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4261180Z graph_break [] 2025-12-04T10:01:25.4261316Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4261385Z Autotune Choices Stats: 2025-12-04T10:01:25.4262998Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.4263397Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4263692Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4264050Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4265389Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4266673Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4268015Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4269295Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4270589Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4271882Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4272212Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:25.4272316Z Autotune Choices Stats: 2025-12-04T10:01:25.4274004Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:25.4274531Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4274931Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4275579Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4276929Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4278258Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4279585Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4280919Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4282244Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4283687Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4285047Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4286375Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4287704Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4289027Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4289318Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:25.4289457Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4289535Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4289599Z unimplemented [] 2025-12-04T10:01:25.4289706Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4289924Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4291325Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4291429Z graph_break [] 2025-12-04T10:01:25.4291564Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4291680Z Autotune Choices Stats: 2025-12-04T10:01:25.4293330Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.4293616Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4293866Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4294263Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4295573Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4296867Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4298161Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4299442Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4300744Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4302028Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4302375Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:25.4302444Z Autotune Choices Stats: 2025-12-04T10:01:25.4304160Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.4304685Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4305053Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4305694Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4307044Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4308445Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4309783Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4311125Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4312554Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4313951Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4315283Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4316612Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4320033Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4321361Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4321672Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:25.4321825Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4321898Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4321963Z unimplemented [] 2025-12-04T10:01:25.4322079Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4322287Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4323689Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4323826Z graph_break [] 2025-12-04T10:01:25.4323967Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4324039Z Autotune Choices Stats: 2025-12-04T10:01:25.4325720Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_574", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.4326026Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4326271Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4326636Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4327948Z triton_flex_attention_574 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4329237Z triton_flex_attention_575 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4330605Z triton_flex_attention_572 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4331879Z triton_flex_attention_570 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4333154Z triton_flex_attention_573 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4334529Z triton_flex_attention_571 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4334822Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3286 seconds precompiling for 6 choices 2025-12-04T10:01:25.4334893Z Autotune Choices Stats: 2025-12-04T10:01:25.4336582Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014303999952971935, "best_triton_pos": 0} 2025-12-04T10:01:25.4337115Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4337477Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4338132Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4339514Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4340843Z triton_flex_attention_backward_576 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4342172Z triton_flex_attention_backward_577 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4343529Z triton_flex_attention_backward_578 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4344886Z triton_flex_attention_backward_581 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4346332Z triton_flex_attention_backward_583 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4348006Z triton_flex_attention_backward_580 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4349404Z triton_flex_attention_backward_582 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4350767Z triton_flex_attention_backward_585 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4352090Z triton_flex_attention_backward_588 0.0174 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.4352375Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2260 seconds precompiling for 13 choices 2025-12-04T10:01:25.4352522Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4352628Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4352691Z unimplemented [] 2025-12-04T10:01:25.4352802Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4353027Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4354472Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4354541Z graph_break [] 2025-12-04T10:01:25.4354676Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4354749Z Autotune Choices Stats: 2025-12-04T10:01:25.4356962Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.4357326Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4357620Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4358031Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4359325Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4360673Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4361949Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4363229Z triton_flex_attention_589 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4364571Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4365890Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4366215Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.2961 seconds precompiling for 6 choices 2025-12-04T10:01:25.4366288Z Autotune Choices Stats: 2025-12-04T10:01:25.4367948Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.4368475Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4368842Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4369547Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4370890Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4372219Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4373553Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4374946Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4376310Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4377632Z triton_flex_attention_backward_601 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4378952Z triton_flex_attention_backward_602 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4380312Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4381631Z triton_flex_attention_backward_604 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4382967Z triton_flex_attention_backward_603 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4383290Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.3242 seconds precompiling for 13 choices 2025-12-04T10:01:25.4383432Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4383506Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4383569Z unimplemented [] 2025-12-04T10:01:25.4383681Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4383888Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4385316Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4385417Z graph_break [] 2025-12-04T10:01:25.4385556Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4385625Z Autotune Choices Stats: 2025-12-04T10:01:25.4387288Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.4387598Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4387839Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4388242Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4389527Z triton_flex_attention_612 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4390808Z triton_flex_attention_613 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4392083Z triton_flex_attention_608 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4393408Z triton_flex_attention_610 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4394730Z triton_flex_attention_611 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4396055Z triton_flex_attention_609 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4396339Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3064 seconds precompiling for 6 choices 2025-12-04T10:01:25.4396408Z Autotune Choices Stats: 2025-12-04T10:01:25.4398050Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_616", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.4398610Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4398973Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4399609Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4400942Z triton_flex_attention_backward_616 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4402267Z triton_flex_attention_backward_615 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4403650Z triton_flex_attention_backward_617 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4405011Z triton_flex_attention_backward_614 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4406335Z triton_flex_attention_backward_619 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4407657Z triton_flex_attention_backward_620 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4409016Z triton_flex_attention_backward_618 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4410338Z triton_flex_attention_backward_621 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4411659Z triton_flex_attention_backward_623 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4413020Z triton_flex_attention_backward_622 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4413304Z SingleProcess AUTOTUNE benchmarking takes 0.6701 seconds and 2.3164 seconds precompiling for 13 choices 2025-12-04T10:01:25.4413479Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4413549Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4413612Z unimplemented [] 2025-12-04T10:01:25.4413720Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4413924Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4415360Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4415426Z graph_break [] 2025-12-04T10:01:25.4415566Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4415635Z Autotune Choices Stats: 2025-12-04T10:01:25.4417228Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.4417568Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4417802Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4418161Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4419445Z triton_flex_attention_631 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4420725Z triton_flex_attention_632 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4421997Z triton_flex_attention_629 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4423339Z triton_flex_attention_627 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4424663Z triton_flex_attention_630 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4425944Z triton_flex_attention_628 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4426228Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.2948 seconds precompiling for 6 choices 2025-12-04T10:01:25.4426302Z Autotune Choices Stats: 2025-12-04T10:01:25.4428021Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.4428594Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4428961Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4429601Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4430932Z triton_flex_attention_backward_634 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4432304Z triton_flex_attention_backward_635 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4433660Z triton_flex_attention_backward_636 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4435015Z triton_flex_attention_backward_633 0.0144 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4436345Z triton_flex_attention_backward_638 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4437661Z triton_flex_attention_backward_640 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4439021Z triton_flex_attention_backward_637 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4440352Z triton_flex_attention_backward_639 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4441668Z triton_flex_attention_backward_642 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4443074Z triton_flex_attention_backward_641 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4443352Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.2589 seconds precompiling for 13 choices 2025-12-04T10:01:25.4443499Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4443568Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4443665Z unimplemented [] 2025-12-04T10:01:25.4443776Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4443983Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4445387Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4445454Z graph_break [] 2025-12-04T10:01:25.4445588Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4445663Z Autotune Choices Stats: 2025-12-04T10:01:25.4447267Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.4447599Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4447833Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4448194Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4449483Z triton_flex_attention_650 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4450761Z triton_flex_attention_651 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4452111Z triton_flex_attention_646 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4453427Z triton_flex_attention_648 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4454716Z triton_flex_attention_649 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4456167Z triton_flex_attention_647 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4456544Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3235 seconds precompiling for 6 choices 2025-12-04T10:01:25.4456613Z Autotune Choices Stats: 2025-12-04T10:01:25.4458263Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_653", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.4458792Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4459152Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4459796Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4461135Z triton_flex_attention_backward_653 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4462579Z triton_flex_attention_backward_654 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4463953Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4465274Z triton_flex_attention_backward_652 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4466595Z triton_flex_attention_backward_656 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4467995Z triton_flex_attention_backward_657 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4469327Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4470649Z triton_flex_attention_backward_658 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4472035Z triton_flex_attention_backward_661 0.0164 ms 80.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4473394Z triton_flex_attention_backward_660 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4473680Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.4000 seconds precompiling for 13 choices 2025-12-04T10:01:25.4473822Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4473892Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4473956Z unimplemented [] 2025-12-04T10:01:25.4474071Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4474278Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4475677Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4475784Z graph_break [] 2025-12-04T10:01:25.4475916Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4475988Z Autotune Choices Stats: 2025-12-04T10:01:25.4477576Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.4477867Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4478105Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4478466Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4479752Z triton_flex_attention_669 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4481098Z triton_flex_attention_670 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4482400Z triton_flex_attention_665 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4483721Z triton_flex_attention_667 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4484997Z triton_flex_attention_668 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4486278Z triton_flex_attention_666 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4486609Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.2940 seconds precompiling for 6 choices 2025-12-04T10:01:25.4486677Z Autotune Choices Stats: 2025-12-04T10:01:25.4488326Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.4488863Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4489225Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4489905Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4491590Z triton_flex_attention_backward_673 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4492955Z triton_flex_attention_backward_674 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4494286Z triton_flex_attention_backward_671 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4495602Z triton_flex_attention_backward_672 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4496960Z triton_flex_attention_backward_676 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4498284Z triton_flex_attention_backward_675 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4499615Z triton_flex_attention_backward_677 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4500990Z triton_flex_attention_backward_678 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4502341Z triton_flex_attention_backward_680 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4503689Z triton_flex_attention_backward_683 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.4503980Z SingleProcess AUTOTUNE benchmarking takes 0.6679 seconds and 2.3879 seconds precompiling for 13 choices 2025-12-04T10:01:25.4504127Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4504200Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4504265Z unimplemented [] 2025-12-04T10:01:25.4504381Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4504582Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4505978Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4506078Z graph_break [] 2025-12-04T10:01:25.4506223Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4506306Z Autotune Choices Stats: 2025-12-04T10:01:25.4507938Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_689", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010367999784648418, "best_triton_pos": 0} 2025-12-04T10:01:25.4508235Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4508479Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4508843Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4510174Z triton_flex_attention_689 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4511531Z triton_flex_attention_688 0.0113 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4512838Z triton_flex_attention_684 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4514129Z triton_flex_attention_686 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4515401Z triton_flex_attention_687 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4516750Z triton_flex_attention_685 0.0164 ms 63.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4517039Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3120 seconds precompiling for 6 choices 2025-12-04T10:01:25.4517108Z Autotune Choices Stats: 2025-12-04T10:01:25.4518741Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_690", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.4519269Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4519667Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4520312Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4521699Z triton_flex_attention_backward_690 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4523028Z triton_flex_attention_backward_691 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4524355Z triton_flex_attention_backward_692 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4525670Z triton_flex_attention_backward_693 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4527032Z triton_flex_attention_backward_695 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4528349Z triton_flex_attention_backward_696 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4529678Z triton_flex_attention_backward_697 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4531056Z triton_flex_attention_backward_694 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4532410Z triton_flex_attention_backward_699 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4533732Z triton_flex_attention_backward_702 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.4534014Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.3417 seconds precompiling for 13 choices 2025-12-04T10:01:25.4534155Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4534223Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4534328Z unimplemented [] 2025-12-04T10:01:25.4534439Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4534643Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4536043Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4536106Z graph_break [] 2025-12-04T10:01:25.4536239Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4536311Z Autotune Choices Stats: 2025-12-04T10:01:25.4537921Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.4538210Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4538488Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4538840Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4540155Z triton_flex_attention_707 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4541469Z triton_flex_attention_708 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4542754Z triton_flex_attention_705 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4544047Z triton_flex_attention_706 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4545355Z triton_flex_attention_703 0.0144 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4546639Z triton_flex_attention_704 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4546916Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.2972 seconds precompiling for 6 choices 2025-12-04T10:01:25.4546987Z Autotune Choices Stats: 2025-12-04T10:01:25.4548698Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_709", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.4549262Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4549653Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4550300Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4551689Z triton_flex_attention_backward_709 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4553023Z triton_flex_attention_backward_710 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4554352Z triton_flex_attention_backward_711 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4556000Z triton_flex_attention_backward_712 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4557332Z triton_flex_attention_backward_714 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4558645Z triton_flex_attention_backward_715 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4560092Z triton_flex_attention_backward_713 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4561451Z triton_flex_attention_backward_716 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4562764Z triton_flex_attention_backward_721 0.0174 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.4564087Z triton_flex_attention_backward_717 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4564426Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.1509 seconds precompiling for 13 choices 2025-12-04T10:01:25.4564569Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4564638Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4564707Z unimplemented [] 2025-12-04T10:01:25.4564815Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4565015Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4566425Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4566488Z graph_break [] 2025-12-04T10:01:25.4566621Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4566695Z Autotune Choices Stats: 2025-12-04T10:01:25.4568313Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.4568644Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4568895Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4569291Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4570610Z triton_flex_attention_726 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4571893Z triton_flex_attention_727 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4573164Z triton_flex_attention_722 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4574483Z triton_flex_attention_724 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4575760Z triton_flex_attention_725 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4577054Z triton_flex_attention_723 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4577333Z SingleProcess AUTOTUNE benchmarking takes 0.2941 seconds and 1.2814 seconds precompiling for 6 choices 2025-12-04T10:01:25.4577441Z Autotune Choices Stats: 2025-12-04T10:01:25.4579080Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_729", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.4579634Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4579996Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4580685Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4582028Z triton_flex_attention_backward_729 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4583363Z triton_flex_attention_backward_730 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4584716Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4586029Z triton_flex_attention_backward_733 0.0144 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4587433Z triton_flex_attention_backward_728 0.0153 ms 87.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4588792Z triton_flex_attention_backward_735 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4590167Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4591511Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4592840Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4594160Z triton_flex_attention_backward_740 0.0173 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.4594474Z SingleProcess AUTOTUNE benchmarking takes 0.6682 seconds and 2.2393 seconds precompiling for 13 choices 2025-12-04T10:01:25.4594615Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4594687Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4594751Z unimplemented [] 2025-12-04T10:01:25.4594861Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4595063Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4596468Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4596546Z graph_break [] 2025-12-04T10:01:25.4596681Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4596790Z Autotune Choices Stats: 2025-12-04T10:01:25.4598388Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.4598714Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4598952Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4599311Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4600620Z triton_flex_attention_745 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4601902Z triton_flex_attention_746 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4603181Z triton_flex_attention_743 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4604492Z triton_flex_attention_741 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4605767Z triton_flex_attention_744 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4607060Z triton_flex_attention_742 0.0164 ms 62.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4607377Z SingleProcess AUTOTUNE benchmarking takes 0.2954 seconds and 1.3187 seconds precompiling for 6 choices 2025-12-04T10:01:25.4607450Z Autotune Choices Stats: 2025-12-04T10:01:25.4609116Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_750", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.4609673Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4610037Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4610678Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4612017Z triton_flex_attention_backward_750 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4613371Z triton_flex_attention_backward_748 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4614684Z triton_flex_attention_backward_749 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4616011Z triton_flex_attention_backward_753 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4617363Z triton_flex_attention_backward_747 0.0144 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4618709Z triton_flex_attention_backward_752 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4620069Z triton_flex_attention_backward_754 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4621388Z triton_flex_attention_backward_751 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4622711Z triton_flex_attention_backward_756 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4624211Z triton_flex_attention_backward_759 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.4624491Z SingleProcess AUTOTUNE benchmarking takes 0.6710 seconds and 2.3823 seconds precompiling for 13 choices 2025-12-04T10:01:25.4624630Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4624700Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4624762Z unimplemented [] 2025-12-04T10:01:25.4624873Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4625076Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4626473Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4626602Z graph_break [] 2025-12-04T10:01:25.4626736Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4626810Z Autotune Choices Stats: 2025-12-04T10:01:25.4628475Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_765", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010304000228643417, "best_triton_pos": 0} 2025-12-04T10:01:25.4628768Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4629053Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4629416Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4630696Z triton_flex_attention_765 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4631979Z triton_flex_attention_764 0.0113 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4633292Z triton_flex_attention_762 0.0133 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4634575Z triton_flex_attention_760 0.0143 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4635852Z triton_flex_attention_763 0.0143 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4637165Z triton_flex_attention_761 0.0154 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4637473Z SingleProcess AUTOTUNE benchmarking takes 0.2951 seconds and 1.3301 seconds precompiling for 6 choices 2025-12-04T10:01:25.4637552Z Autotune Choices Stats: 2025-12-04T10:01:25.4639220Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_767", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.4639763Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4640131Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4640782Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4642132Z triton_flex_attention_backward_767 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4643520Z triton_flex_attention_backward_769 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4644848Z triton_flex_attention_backward_766 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4646172Z triton_flex_attention_backward_768 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4647565Z triton_flex_attention_backward_771 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4648912Z triton_flex_attention_backward_772 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4650242Z triton_flex_attention_backward_770 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4651568Z triton_flex_attention_backward_773 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4652937Z triton_flex_attention_backward_775 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4654265Z triton_flex_attention_backward_778 0.0174 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.4654547Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2444 seconds precompiling for 13 choices 2025-12-04T10:01:25.4654691Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4654762Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4654824Z unimplemented [] 2025-12-04T10:01:25.4654938Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4655178Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4656891Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4656963Z graph_break [] 2025-12-04T10:01:25.4657178Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4657245Z Autotune Choices Stats: 2025-12-04T10:01:25.4658887Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_783", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.4659183Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4659423Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4659783Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4661073Z triton_flex_attention_783 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4662429Z triton_flex_attention_784 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4663701Z triton_flex_attention_779 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4664999Z triton_flex_attention_781 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4666382Z triton_flex_attention_782 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4667784Z triton_flex_attention_780 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4668072Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3189 seconds precompiling for 6 choices 2025-12-04T10:01:25.4668139Z Autotune Choices Stats: 2025-12-04T10:01:25.4669810Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_786", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.4670335Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4670697Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4671373Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4672702Z triton_flex_attention_backward_786 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4674027Z triton_flex_attention_backward_787 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4675356Z triton_flex_attention_backward_788 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4676701Z triton_flex_attention_backward_785 0.0145 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4678105Z triton_flex_attention_backward_790 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4679431Z triton_flex_attention_backward_791 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4680761Z triton_flex_attention_backward_792 0.0155 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4682125Z triton_flex_attention_backward_789 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4683439Z triton_flex_attention_backward_794 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4684775Z triton_flex_attention_backward_797 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.4685055Z SingleProcess AUTOTUNE benchmarking takes 0.6703 seconds and 2.2711 seconds precompiling for 13 choices 2025-12-04T10:01:25.4685240Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4685310Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4685373Z unimplemented [] 2025-12-04T10:01:25.4685482Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4685688Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4687127Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4687190Z graph_break [] 2025-12-04T10:01:25.4687321Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4687395Z Autotune Choices Stats: 2025-12-04T10:01:25.4689037Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_803", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.4689333Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4689572Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4689928Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4691212Z triton_flex_attention_803 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4692532Z triton_flex_attention_802 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4693818Z triton_flex_attention_800 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4695101Z triton_flex_attention_798 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4696464Z triton_flex_attention_801 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4697785Z triton_flex_attention_799 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4698075Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T10:01:25.4698142Z Autotune Choices Stats: 2025-12-04T10:01:25.4699783Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_806", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.4700310Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4700704Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4701349Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4702681Z triton_flex_attention_backward_806 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4704014Z triton_flex_attention_backward_805 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4705379Z triton_flex_attention_backward_807 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4706741Z triton_flex_attention_backward_804 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4708150Z triton_flex_attention_backward_809 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4709468Z triton_flex_attention_backward_810 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4710796Z triton_flex_attention_backward_811 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4712153Z triton_flex_attention_backward_808 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4713478Z triton_flex_attention_backward_812 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4714809Z triton_flex_attention_backward_813 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4715124Z SingleProcess AUTOTUNE benchmarking takes 0.6698 seconds and 2.2839 seconds precompiling for 13 choices 2025-12-04T10:01:25.4715268Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4715337Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4715409Z unimplemented [] 2025-12-04T10:01:25.4715527Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4715767Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4717207Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4717274Z graph_break [] 2025-12-04T10:01:25.4717408Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4717482Z Autotune Choices Stats: 2025-12-04T10:01:25.4719082Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_821", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.4719379Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4719654Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4720016Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4721313Z triton_flex_attention_821 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4722590Z triton_flex_attention_822 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4723878Z triton_flex_attention_817 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4725202Z triton_flex_attention_819 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4726543Z triton_flex_attention_820 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4727829Z triton_flex_attention_818 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4728113Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3176 seconds precompiling for 6 choices 2025-12-04T10:01:25.4728181Z Autotune Choices Stats: 2025-12-04T10:01:25.4729818Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_825", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.4730386Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4730743Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4731383Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4732716Z triton_flex_attention_backward_825 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4734047Z triton_flex_attention_backward_824 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4735447Z triton_flex_attention_backward_826 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4736804Z triton_flex_attention_backward_823 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4738140Z triton_flex_attention_backward_828 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4739460Z triton_flex_attention_backward_827 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4740813Z triton_flex_attention_backward_829 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4742144Z triton_flex_attention_backward_830 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4743465Z triton_flex_attention_backward_832 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4744825Z triton_flex_attention_backward_835 0.0164 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.4745140Z SingleProcess AUTOTUNE benchmarking takes 0.6673 seconds and 2.2875 seconds precompiling for 13 choices 2025-12-04T10:01:25.4745278Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4745356Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4745420Z unimplemented [] 2025-12-04T10:01:25.4745528Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4745763Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4747154Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4747270Z graph_break [] 2025-12-04T10:01:25.4747436Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4747510Z Autotune Choices Stats: 2025-12-04T10:01:25.4749112Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_840", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.4749442Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4749680Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4750035Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4751327Z triton_flex_attention_840 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4752600Z triton_flex_attention_841 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4753911Z triton_flex_attention_836 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4755462Z triton_flex_attention_838 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4756867Z triton_flex_attention_839 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4758165Z triton_flex_attention_837 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4758452Z SingleProcess AUTOTUNE benchmarking takes 0.2950 seconds and 1.3350 seconds precompiling for 6 choices 2025-12-04T10:01:25.4758520Z Autotune Choices Stats: 2025-12-04T10:01:25.4760216Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_843", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.4760747Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4761115Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4761762Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4763114Z triton_flex_attention_backward_843 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4764539Z triton_flex_attention_backward_844 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4765904Z triton_flex_attention_backward_845 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4767230Z triton_flex_attention_backward_842 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4768554Z triton_flex_attention_backward_847 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4769904Z triton_flex_attention_backward_846 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4771228Z triton_flex_attention_backward_848 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4772559Z triton_flex_attention_backward_849 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4773935Z triton_flex_attention_backward_851 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4775290Z triton_flex_attention_backward_850 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4775572Z SingleProcess AUTOTUNE benchmarking takes 0.6676 seconds and 2.3506 seconds precompiling for 13 choices 2025-12-04T10:01:25.4775746Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4775818Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4775883Z unimplemented [] 2025-12-04T10:01:25.4776000Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4776206Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4777615Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4777681Z graph_break [] 2025-12-04T10:01:25.4777816Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4777889Z Autotune Choices Stats: 2025-12-04T10:01:25.4779532Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_859", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.4779826Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4780068Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4780431Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4781728Z triton_flex_attention_859 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4783012Z triton_flex_attention_860 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4784370Z triton_flex_attention_857 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4785691Z triton_flex_attention_858 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4787198Z triton_flex_attention_855 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4788703Z triton_flex_attention_856 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4789023Z SingleProcess AUTOTUNE benchmarking takes 0.2946 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:25.4789097Z Autotune Choices Stats: 2025-12-04T10:01:25.4790738Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_862", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.4791264Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4791627Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4792271Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4793646Z triton_flex_attention_backward_862 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4795015Z triton_flex_attention_backward_863 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4796501Z triton_flex_attention_backward_864 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4798011Z triton_flex_attention_backward_861 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4799346Z triton_flex_attention_backward_865 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4800714Z triton_flex_attention_backward_866 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4802054Z triton_flex_attention_backward_868 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4803384Z triton_flex_attention_backward_867 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4804783Z triton_flex_attention_backward_870 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4806142Z triton_flex_attention_backward_869 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4806427Z SingleProcess AUTOTUNE benchmarking takes 0.6670 seconds and 2.3594 seconds precompiling for 13 choices 2025-12-04T10:01:25.4806628Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:25.4806708Z Traceback (most recent call last): 2025-12-04T10:01:25.4807065Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:25.4807137Z self.assertTrue( 2025-12-04T10:01:25.4807364Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:25.4807452Z raise self.failureException(msg) 2025-12-04T10:01:25.4807729Z AssertionError: False is not true : Log file /tmp/tmpxbv7srfc/flex_attention_configs.json was not created 2025-12-04T10:01:25.4807734Z 2025-12-04T10:01:25.4807878Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:25.4808235Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:25.4808239Z 2025-12-04T10:01:25.4808420Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:25.4808564Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4808634Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4808698Z unimplemented [] 2025-12-04T10:01:25.4808811Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4810223Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:25.4810437Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4810497Z graph_break [] 2025-12-04T10:01:25.4810630Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4811811Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:25.4811933Z current_size = base.storage().size() 2025-12-04T10:01:25.4812009Z Autotune Choices Stats: 2025-12-04T10:01:25.4813638Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:25.4813933Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4814175Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4814573Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4815894Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4817184Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4818545Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4819852Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4821140Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4822426Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4822754Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:25.4822830Z Autotune Choices Stats: 2025-12-04T10:01:25.4824558Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.4825087Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4825456Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4826106Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4827550Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4828941Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4830283Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4831629Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4833008Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4834376Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4835777Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4837120Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4838465Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4839847Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4840134Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:25.4840275Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4840355Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4840420Z unimplemented [] 2025-12-04T10:01:25.4840531Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4840743Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4842161Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4842269Z graph_break [] 2025-12-04T10:01:25.4842405Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4842480Z Autotune Choices Stats: 2025-12-04T10:01:25.4844150Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.4844501Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4844743Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4845106Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4846426Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4847732Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4849074Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4850386Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4851677Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4853040Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4853327Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:25.4853402Z Autotune Choices Stats: 2025-12-04T10:01:25.4855102Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.4856055Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4856645Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4857677Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4859250Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4860614Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4861956Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4863301Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4864769Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4866155Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4867576Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4868925Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4870313Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4871657Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4871947Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:25.4872087Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4872165Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4872273Z unimplemented [] 2025-12-04T10:01:25.4872383Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4872597Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4874042Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4874113Z graph_break [] 2025-12-04T10:01:25.4874250Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4874323Z Autotune Choices Stats: 2025-12-04T10:01:25.4875980Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.4876290Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4876529Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4876888Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4878206Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4879541Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4880854Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4882153Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4883476Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4884826Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4885146Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:25.4885222Z Autotune Choices Stats: 2025-12-04T10:01:25.4886887Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.4887418Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4887780Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4888475Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4889834Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4891183Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4892527Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4893944Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4895307Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4896656Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4897996Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4899363Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4900710Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.4902054Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4902379Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:25.4902517Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4902594Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4902660Z unimplemented [] 2025-12-04T10:01:25.4902768Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4902981Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4904438Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4904511Z graph_break [] 2025-12-04T10:01:25.4904677Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4904746Z Autotune Choices Stats: 2025-12-04T10:01:25.4906368Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.4906659Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4906899Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4907306Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4908657Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4909958Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4911263Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4912570Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4913948Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4920397Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4920737Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:25.4920823Z Autotune Choices Stats: 2025-12-04T10:01:25.4922482Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.4923061Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4923433Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4924076Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4925421Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4926801Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4928202Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4929584Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4930951Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4932301Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4933640Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4935001Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4936344Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4937676Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.4938011Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:25.4938159Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4938243Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4938352Z unimplemented [] 2025-12-04T10:01:25.4938471Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4938688Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4940154Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4940229Z graph_break [] 2025-12-04T10:01:25.4940373Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4940444Z Autotune Choices Stats: 2025-12-04T10:01:25.4942067Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.4942363Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4942654Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4943017Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4944333Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4945621Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4946928Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4948378Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4949849Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4951144Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4951433Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:25.4951511Z Autotune Choices Stats: 2025-12-04T10:01:25.4953157Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.4953722Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4954091Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4954737Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4956357Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4957782Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4959163Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4960541Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4961864Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4963194Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4964626Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4965951Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4967278Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.4968668Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4968966Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:25.4969109Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4969194Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4969260Z unimplemented [] 2025-12-04T10:01:25.4969370Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4969615Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.4971036Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.4971104Z graph_break [] 2025-12-04T10:01:25.4971242Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.4971314Z Autotune Choices Stats: 2025-12-04T10:01:25.4972929Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.4973268Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4973517Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4973877Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4975173Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4976462Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4977781Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.4979091Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.4980447Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4981734Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4982024Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:25.4982128Z Autotune Choices Stats: 2025-12-04T10:01:25.4983785Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.4984315Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.4984677Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.4985322Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.4986676Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4988189Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4989557Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4990899Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4992223Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4993589Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.4994918Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4996256Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.4997624Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.4998986Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.4999344Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:25.4999484Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.4999565Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.4999634Z unimplemented [] 2025-12-04T10:01:25.4999743Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.4999956Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5001365Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5001437Z graph_break [] 2025-12-04T10:01:25.5001573Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5001682Z Autotune Choices Stats: 2025-12-04T10:01:25.5003288Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.5003574Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5003827Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5004183Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5005482Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5006819Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5008403Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5009726Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5011018Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5012310Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5012630Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:25.5012701Z Autotune Choices Stats: 2025-12-04T10:01:25.5014369Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.5014893Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5015271Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5015919Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5017314Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5018712Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5020042Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5021374Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5022748Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5024076Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5025406Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5026725Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5028176Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.5029543Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5029836Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:25.5029975Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5030051Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5030116Z unimplemented [] 2025-12-04T10:01:25.5030222Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5030431Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5031840Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5031942Z graph_break [] 2025-12-04T10:01:25.5032078Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5032146Z Autotune Choices Stats: 2025-12-04T10:01:25.5033757Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:25.5034051Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5034297Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5034652Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5035948Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5037311Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5038637Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5039924Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5041208Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5042529Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5042812Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:25.5042881Z Autotune Choices Stats: 2025-12-04T10:01:25.5044545Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.5045066Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5045467Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5046116Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5047486Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5048857Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5050184Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5051525Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5052884Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5054217Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5055852Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5057342Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5058721Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5060050Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.5060343Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:25.5060484Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5060561Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5060627Z unimplemented [] 2025-12-04T10:01:25.5060738Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5061007Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5062423Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5062493Z graph_break [] 2025-12-04T10:01:25.5062630Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5062698Z Autotune Choices Stats: 2025-12-04T10:01:25.5064314Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:25.5064607Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5064850Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5065246Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5066590Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5067969Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5069268Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5070555Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5071879Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5073168Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5073454Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:25.5073527Z Autotune Choices Stats: 2025-12-04T10:01:25.5075186Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.5075760Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5076142Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5076811Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5078189Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5079507Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5080845Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5082215Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5083541Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5084868Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5086229Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5087591Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5088950Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5090273Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.5090565Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:25.5090735Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5090805Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5090874Z unimplemented [] 2025-12-04T10:01:25.5090980Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5091191Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5092600Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5092670Z graph_break [] 2025-12-04T10:01:25.5092804Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5092873Z Autotune Choices Stats: 2025-12-04T10:01:25.5094483Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.5094820Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5095069Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5095423Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5096765Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5098073Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5099364Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5100648Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5101975Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5103268Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5103551Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:25.5103619Z Autotune Choices Stats: 2025-12-04T10:01:25.5105273Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.5105859Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5106227Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5106905Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5108301Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5109637Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5111050Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5112427Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5113771Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5115146Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5116527Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5117888Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5119214Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5120543Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.5120871Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:25.5121012Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5121091Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5121160Z unimplemented [] 2025-12-04T10:01:25.5121276Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5121484Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5122907Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5122980Z graph_break [] 2025-12-04T10:01:25.5123117Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5123196Z Autotune Choices Stats: 2025-12-04T10:01:25.5124811Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:25.5125145Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5125419Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5125777Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5127115Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5128404Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5129689Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5131021Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5132303Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5133597Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5133926Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:25.5134006Z Autotune Choices Stats: 2025-12-04T10:01:25.5135726Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.5136255Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5136650Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5137302Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5138644Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5139980Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5141340Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5142668Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5144001Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5145389Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5146778Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5148152Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5149482Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.5150845Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5151130Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:25.5151266Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5151346Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5151413Z unimplemented [] 2025-12-04T10:01:25.5151527Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5151735Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5153142Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5153252Z graph_break [] 2025-12-04T10:01:25.5153389Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5153462Z Autotune Choices Stats: 2025-12-04T10:01:25.5155116Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.5155678Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5156001Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5156364Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5157677Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5158962Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5160302Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5161593Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5162875Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5164220Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5164505Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:25.5164580Z Autotune Choices Stats: 2025-12-04T10:01:25.5166316Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.5166856Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5167218Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5167870Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5169222Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5170604Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5171936Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5173274Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5174680Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5176038Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5177375Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5178703Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5180066Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.5181398Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5181691Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:25.5181831Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5181912Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5181982Z unimplemented [] 2025-12-04T10:01:25.5182090Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5182304Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5183748Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5183820Z graph_break [] 2025-12-04T10:01:25.5183955Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5184030Z Autotune Choices Stats: 2025-12-04T10:01:25.5185723Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:25.5186029Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5186270Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5186627Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5188026Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5189333Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5190673Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5191963Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5193245Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5194607Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5194893Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:25.5194971Z Autotune Choices Stats: 2025-12-04T10:01:25.5196824Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.5197462Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5197894Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5198607Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5199988Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5201338Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5202685Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5204057Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5205418Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5206971Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5208399Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5209732Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5211100Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5212435Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5212736Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:25.5212878Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5212991Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5213058Z unimplemented [] 2025-12-04T10:01:25.5213166Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5213383Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5214817Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5214889Z graph_break [] 2025-12-04T10:01:25.5215023Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5215102Z Autotune Choices Stats: 2025-12-04T10:01:25.5216741Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.5217039Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5217281Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5217640Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5218945Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5220263Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5221559Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5222858Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5224218Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5225549Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5225836Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:25.5225911Z Autotune Choices Stats: 2025-12-04T10:01:25.5227609Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.5228138Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5228499Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5229185Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5230525Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5231879Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5233207Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5234599Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5235971Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5237319Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5238652Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5240017Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5241349Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5242677Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.5243003Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:25.5243140Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5243215Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5243278Z unimplemented [] 2025-12-04T10:01:25.5243387Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5243598Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5245050Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5245119Z graph_break [] 2025-12-04T10:01:25.5245264Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5245335Z Autotune Choices Stats: 2025-12-04T10:01:25.5246945Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.5247232Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5247478Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5247868Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5249177Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5250464Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5251755Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5253076Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5254387Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5256010Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5256311Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:25.5256390Z Autotune Choices Stats: 2025-12-04T10:01:25.5258062Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.5258664Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5259029Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5259682Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5261039Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5262390Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5263823Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5265192Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5266529Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5267920Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5269298Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5270631Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5271974Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.5273339Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5273631Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:25.5273803Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5273881Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5273947Z unimplemented [] 2025-12-04T10:01:25.5274054Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5274269Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5275707Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5275779Z graph_break [] 2025-12-04T10:01:25.5275915Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5275995Z Autotune Choices Stats: 2025-12-04T10:01:25.5277609Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.5277946Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5278193Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5278548Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5279854Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5281140Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5282457Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5283779Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5285097Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5286390Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5286673Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:25.5286749Z Autotune Choices Stats: 2025-12-04T10:01:25.5288405Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.5288970Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5289332Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5289982Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5291334Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5292762Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5294125Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5295517Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5296846Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5298185Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5299554Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5300886Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5302220Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5303608Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.5303897Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:25.5304038Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5304145Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5304212Z unimplemented [] 2025-12-04T10:01:25.5304319Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5304530Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5305930Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5306000Z graph_break [] 2025-12-04T10:01:25.5306134Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5306201Z Autotune Choices Stats: 2025-12-04T10:01:25.5307858Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.5308189Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5308439Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5308792Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5310096Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5311383Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5312742Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5314062Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5317338Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5320292Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5322045Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:25.5322487Z Autotune Choices Stats: 2025-12-04T10:01:25.5324275Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.5326567Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5327695Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5328944Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5330959Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5333417Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5335873Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5338293Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5340841Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5343366Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5345804Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5348276Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5350817Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5353260Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.5354768Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:25.5355451Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5355821Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5356026Z unimplemented [] 2025-12-04T10:01:25.5356241Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5356613Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5358076Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5359482Z graph_break [] 2025-12-04T10:01:25.5359725Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5360013Z Autotune Choices Stats: 2025-12-04T10:01:25.5361588Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.5363338Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5363881Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5364496Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5366040Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5368479Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5370917Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5373282Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5375613Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5377983Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5379471Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:25.5379868Z Autotune Choices Stats: 2025-12-04T10:01:25.5381436Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.5383429Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5384270Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5385277Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5387134Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5389705Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5392173Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5394605Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5397131Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5399540Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5401944Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5404427Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5406866Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5409323Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5410818Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:25.5411287Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5411576Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5411776Z unimplemented [] 2025-12-04T10:01:25.5411982Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5412354Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5413833Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5415195Z graph_break [] 2025-12-04T10:01:25.5415417Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5415697Z Autotune Choices Stats: 2025-12-04T10:01:25.5417228Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.5418952Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5419508Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5420116Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5421726Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5424085Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5426444Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5428876Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5431202Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5433575Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5435034Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:25.5435432Z Autotune Choices Stats: 2025-12-04T10:01:25.5436993Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.5438985Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5439833Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5440824Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5442690Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5445109Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5447525Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5449964Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5452412Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5454828Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5457549Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5460105Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5462567Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.5464988Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5466522Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:25.5466982Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5467414Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5467616Z unimplemented [] 2025-12-04T10:01:25.5467823Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5468194Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5469661Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5470978Z graph_break [] 2025-12-04T10:01:25.5471210Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5471492Z Autotune Choices Stats: 2025-12-04T10:01:25.5473017Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.5474794Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5475346Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5475961Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5477563Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5479944Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5482285Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5484633Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5487056Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5489395Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5490844Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:25.5491235Z Autotune Choices Stats: 2025-12-04T10:01:25.5492828Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.5494819Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5495695Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5496670Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5498529Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5500967Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5503379Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5505833Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5508308Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5510720Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5513194Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5515629Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5518049Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5520497Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.5522026Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:25.5522487Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5522769Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5522966Z unimplemented [] 2025-12-04T10:01:25.5523172Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5523548Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5525005Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5526322Z graph_break [] 2025-12-04T10:01:25.5526547Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5526826Z Autotune Choices Stats: 2025-12-04T10:01:25.5528353Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.5530126Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5530673Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5531323Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5532920Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5535257Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5537589Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5539981Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5542324Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5544671Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5546127Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:25.5546556Z Autotune Choices Stats: 2025-12-04T10:01:25.5548222Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.5550179Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5551081Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5552061Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5553897Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5556620Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5559224Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5561640Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5564054Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5566569Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5569073Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5571497Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5573906Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5576348Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.5577880Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:25.5578350Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5578642Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5578839Z unimplemented [] 2025-12-04T10:01:25.5579054Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5579428Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5580898Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5582215Z graph_break [] 2025-12-04T10:01:25.5582443Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5582767Z Autotune Choices Stats: 2025-12-04T10:01:25.5584334Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.5586059Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5586597Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5587306Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5588895Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5591237Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5593568Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5595985Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5598326Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5600661Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5602151Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:25.5602544Z Autotune Choices Stats: 2025-12-04T10:01:25.5604143Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:25.5606167Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5607027Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5608004Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5609836Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5612290Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5614707Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5617122Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5619579Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5622047Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5624482Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5626909Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5629389Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5631829Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.5633323Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:25.5633781Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5634068Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5634279Z unimplemented [] 2025-12-04T10:01:25.5634482Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5634855Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5636354Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5637729Z graph_break [] 2025-12-04T10:01:25.5637965Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5638241Z Autotune Choices Stats: 2025-12-04T10:01:25.5639803Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.5641563Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5642107Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5642736Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5644274Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5646620Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5648991Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5651366Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5653705Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5656367Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5657942Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:25.5658335Z Autotune Choices Stats: 2025-12-04T10:01:25.5659947Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.5661924Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5662762Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5663739Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5665564Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5668182Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5670600Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5673014Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5675520Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5677971Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5680429Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5682850Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5685299Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5687720Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5689222Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:25.5689695Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5689983Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5690185Z unimplemented [] 2025-12-04T10:01:25.5690398Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5690831Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5692331Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5693660Z graph_break [] 2025-12-04T10:01:25.5694254Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5694549Z Autotune Choices Stats: 2025-12-04T10:01:25.5696113Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.5697842Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5698388Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5699014Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5700562Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5702929Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5705271Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5707707Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5710080Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5712458Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5713913Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:25.5714341Z Autotune Choices Stats: 2025-12-04T10:01:25.5715907Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.5717910Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5718761Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5719796Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5721634Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5724042Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5726444Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5728916Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5731332Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5733725Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5736115Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5738549Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5740943Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5743339Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.5744871Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:25.5745336Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5745626Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5745819Z unimplemented [] 2025-12-04T10:01:25.5746031Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5746410Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5747973Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5749296Z graph_break [] 2025-12-04T10:01:25.5749522Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5749845Z Autotune Choices Stats: 2025-12-04T10:01:25.5751382Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.5753100Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5753646Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5754268Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5756159Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5758495Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5760830Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5763164Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5765628Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5768020Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5769476Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:25.5769869Z Autotune Choices Stats: 2025-12-04T10:01:25.5771432Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.5773376Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5774277Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5775249Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5777118Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5779584Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5782035Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5784471Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5786888Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5789409Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5791815Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5794254Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5796647Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5799033Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.5800552Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:25.5801010Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5801300Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5801497Z unimplemented [] 2025-12-04T10:01:25.5801740Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5802104Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5803593Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5804931Z graph_break [] 2025-12-04T10:01:25.5805158Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5805433Z Autotune Choices Stats: 2025-12-04T10:01:25.5806988Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.5808702Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5809282Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5809903Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5811452Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5813788Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5816116Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5818507Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5820880Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5823211Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5824662Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:25.5825042Z Autotune Choices Stats: 2025-12-04T10:01:25.5826602Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:25.5828677Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5829516Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5830488Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5832314Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5834726Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5837211Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5839686Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5842081Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5844525Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5846960Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5849367Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5851774Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5854248Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.5856020Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:25.5856489Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5856776Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5856966Z unimplemented [] 2025-12-04T10:01:25.5857172Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5857626Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5859130Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5860453Z graph_break [] 2025-12-04T10:01:25.5860680Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5860961Z Autotune Choices Stats: 2025-12-04T10:01:25.5862486Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.5864269Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5864806Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5865424Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5866978Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5869397Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5871785Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5874178Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5876542Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5878873Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5880325Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:25.5880759Z Autotune Choices Stats: 2025-12-04T10:01:25.5882329Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.5884268Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5885099Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5886073Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5887890Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5890387Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5892810Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5895217Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5897610Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5900099Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5902498Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5904895Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5907416Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5909886Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5911409Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:25.5911885Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5912174Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5912377Z unimplemented [] 2025-12-04T10:01:25.5912580Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5912952Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5914412Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5915724Z graph_break [] 2025-12-04T10:01:25.5915950Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5916275Z Autotune Choices Stats: 2025-12-04T10:01:25.5917828Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.5919546Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5920091Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5920720Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5922263Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5924637Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5926993Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5929350Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5931677Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5934001Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5935489Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:25.5935879Z Autotune Choices Stats: 2025-12-04T10:01:25.5937470Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.5939416Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5940257Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5941234Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5943095Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5945578Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5948035Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5950436Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5952833Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5955489Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5957950Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.5960346Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.5962888Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5965327Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.5966811Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:25.5967320Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.5967604Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.5967801Z unimplemented [] 2025-12-04T10:01:25.5968007Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.5968371Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.5975668Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.5977146Z graph_break [] 2025-12-04T10:01:25.5977399Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.5977699Z Autotune Choices Stats: 2025-12-04T10:01:25.5979243Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.5980984Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.5981541Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.5982159Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.5983697Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5986101Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5988552Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.5990873Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5993191Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.5995541Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.5996987Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:25.5997379Z Autotune Choices Stats: 2025-12-04T10:01:25.5998942Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:25.6000895Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6001781Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6002755Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6004606Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6007070Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6009475Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6011852Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6014271Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6016660Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6019041Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6021491Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6023912Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6026297Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6027831Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:25.6028315Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6028600Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6028800Z unimplemented [] 2025-12-04T10:01:25.6029014Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6029389Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6030915Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6032226Z graph_break [] 2025-12-04T10:01:25.6032462Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6032746Z Autotune Choices Stats: 2025-12-04T10:01:25.6034275Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6035989Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6036528Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6037189Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6038773Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6041132Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6043447Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6045754Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6048070Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6050418Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6051862Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:25.6052253Z Autotune Choices Stats: 2025-12-04T10:01:25.6053803Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.6056119Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6056962Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6058005Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6059881Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6062295Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6064695Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6067152Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6069599Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6071993Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6074461Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6076881Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6079307Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6081698Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6083178Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:25.6083679Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6083962Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6084161Z unimplemented [] 2025-12-04T10:01:25.6084372Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6084747Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6086202Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6087520Z graph_break [] 2025-12-04T10:01:25.6087744Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6088024Z Autotune Choices Stats: 2025-12-04T10:01:25.6089540Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_574", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6091302Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6091848Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6092456Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6094026Z triton_flex_attention_574 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6095197Z triton_flex_attention_575 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6096321Z triton_flex_attention_572 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6097449Z triton_flex_attention_570 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6098613Z triton_flex_attention_573 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6099738Z triton_flex_attention_571 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6099994Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3286 seconds precompiling for 6 choices 2025-12-04T10:01:25.6100063Z Autotune Choices Stats: 2025-12-04T10:01:25.6101503Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014303999952971935, "best_triton_pos": 0} 2025-12-04T10:01:25.6102005Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6102347Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6102948Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6104128Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6105294Z triton_flex_attention_backward_576 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6106500Z triton_flex_attention_backward_577 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6107697Z triton_flex_attention_backward_578 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6108863Z triton_flex_attention_backward_581 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6110054Z triton_flex_attention_backward_583 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6111287Z triton_flex_attention_backward_580 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6112480Z triton_flex_attention_backward_582 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6113641Z triton_flex_attention_backward_585 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6114809Z triton_flex_attention_backward_588 0.0174 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6115091Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2260 seconds precompiling for 13 choices 2025-12-04T10:01:25.6115232Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6115304Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6115371Z unimplemented [] 2025-12-04T10:01:25.6115496Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6115687Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6116872Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6116937Z graph_break [] 2025-12-04T10:01:25.6117072Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6117141Z Autotune Choices Stats: 2025-12-04T10:01:25.6118556Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6118849Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6119101Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6119424Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6120590Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6121719Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6122842Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6124003Z triton_flex_attention_589 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6125133Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6126260Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6126544Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.2961 seconds precompiling for 6 choices 2025-12-04T10:01:25.6126613Z Autotune Choices Stats: 2025-12-04T10:01:25.6128102Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.6128540Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6128916Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6129487Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6130657Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6131826Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6133029Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6134197Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6135366Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6136586Z triton_flex_attention_backward_601 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6137796Z triton_flex_attention_backward_602 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6138956Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6140114Z triton_flex_attention_backward_604 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6141313Z triton_flex_attention_backward_603 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6141559Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.3242 seconds precompiling for 13 choices 2025-12-04T10:01:25.6141707Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6141778Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6141843Z unimplemented [] 2025-12-04T10:01:25.6141953Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6142139Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6143327Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6143425Z graph_break [] 2025-12-04T10:01:25.6143552Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6143626Z Autotune Choices Stats: 2025-12-04T10:01:25.6145068Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.6145320Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6145573Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6145894Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6147049Z triton_flex_attention_612 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6148224Z triton_flex_attention_613 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6149386Z triton_flex_attention_608 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6150516Z triton_flex_attention_610 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6151643Z triton_flex_attention_611 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6152767Z triton_flex_attention_609 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6153050Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3064 seconds precompiling for 6 choices 2025-12-04T10:01:25.6153119Z Autotune Choices Stats: 2025-12-04T10:01:25.6154612Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_616", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.6155059Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6155708Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6156286Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6157461Z triton_flex_attention_backward_616 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6158708Z triton_flex_attention_backward_615 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6159869Z triton_flex_attention_backward_617 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6161027Z triton_flex_attention_backward_614 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6162261Z triton_flex_attention_backward_619 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6163510Z triton_flex_attention_backward_620 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6164669Z triton_flex_attention_backward_618 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6165827Z triton_flex_attention_backward_621 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6167026Z triton_flex_attention_backward_623 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6168193Z triton_flex_attention_backward_622 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6168439Z SingleProcess AUTOTUNE benchmarking takes 0.6701 seconds and 2.3164 seconds precompiling for 13 choices 2025-12-04T10:01:25.6168585Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6168659Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6168726Z unimplemented [] 2025-12-04T10:01:25.6168837Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6169024Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6170210Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6170312Z graph_break [] 2025-12-04T10:01:25.6170441Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6170516Z Autotune Choices Stats: 2025-12-04T10:01:25.6171978Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6172228Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6172449Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6172773Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6173906Z triton_flex_attention_631 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6175043Z triton_flex_attention_632 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6176211Z triton_flex_attention_629 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6177334Z triton_flex_attention_627 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6178464Z triton_flex_attention_630 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6179646Z triton_flex_attention_628 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6179897Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.2948 seconds precompiling for 6 choices 2025-12-04T10:01:25.6179966Z Autotune Choices Stats: 2025-12-04T10:01:25.6181427Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.6181874Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6182201Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6182770Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6183977Z triton_flex_attention_backward_634 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6185154Z triton_flex_attention_backward_635 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6186326Z triton_flex_attention_backward_636 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6187601Z triton_flex_attention_backward_633 0.0144 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6188797Z triton_flex_attention_backward_638 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6189990Z triton_flex_attention_backward_640 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6191155Z triton_flex_attention_backward_637 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6192317Z triton_flex_attention_backward_639 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6193521Z triton_flex_attention_backward_642 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6194691Z triton_flex_attention_backward_641 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6194937Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.2589 seconds precompiling for 13 choices 2025-12-04T10:01:25.6195072Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6195178Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6195240Z unimplemented [] 2025-12-04T10:01:25.6195358Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6195550Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6196777Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6196840Z graph_break [] 2025-12-04T10:01:25.6196967Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6197043Z Autotune Choices Stats: 2025-12-04T10:01:25.6198481Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6198734Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6198956Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6199276Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6200405Z triton_flex_attention_650 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6201562Z triton_flex_attention_651 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6202676Z triton_flex_attention_646 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6203810Z triton_flex_attention_648 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6204968Z triton_flex_attention_649 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6206121Z triton_flex_attention_647 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6206403Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3235 seconds precompiling for 6 choices 2025-12-04T10:01:25.6206475Z Autotune Choices Stats: 2025-12-04T10:01:25.6207914Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_653", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.6208356Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6208681Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6209286Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6210456Z triton_flex_attention_backward_653 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6211621Z triton_flex_attention_backward_654 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6212787Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6214011Z triton_flex_attention_backward_652 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6215206Z triton_flex_attention_backward_656 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6216367Z triton_flex_attention_backward_657 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6217528Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6218724Z triton_flex_attention_backward_658 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6219885Z triton_flex_attention_backward_661 0.0164 ms 80.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6221060Z triton_flex_attention_backward_660 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6221344Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.4000 seconds precompiling for 13 choices 2025-12-04T10:01:25.6221480Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6221555Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6221619Z unimplemented [] 2025-12-04T10:01:25.6221727Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6221913Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6223133Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6223247Z graph_break [] 2025-12-04T10:01:25.6223378Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6223453Z Autotune Choices Stats: 2025-12-04T10:01:25.6224851Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6225104Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6225324Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6225678Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6226813Z triton_flex_attention_669 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6227996Z triton_flex_attention_670 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6229120Z triton_flex_attention_665 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6230281Z triton_flex_attention_667 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6231430Z triton_flex_attention_668 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6232592Z triton_flex_attention_666 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6232854Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.2940 seconds precompiling for 6 choices 2025-12-04T10:01:25.6232924Z Autotune Choices Stats: 2025-12-04T10:01:25.6234354Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.6234828Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6235163Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6235731Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6236910Z triton_flex_attention_backward_673 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6238078Z triton_flex_attention_backward_674 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6239304Z triton_flex_attention_backward_671 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6240499Z triton_flex_attention_backward_672 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6241676Z triton_flex_attention_backward_676 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6242834Z triton_flex_attention_backward_675 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6244035Z triton_flex_attention_backward_677 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6245196Z triton_flex_attention_backward_678 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6246351Z triton_flex_attention_backward_680 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6247580Z triton_flex_attention_backward_683 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6247828Z SingleProcess AUTOTUNE benchmarking takes 0.6679 seconds and 2.3879 seconds precompiling for 13 choices 2025-12-04T10:01:25.6247998Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6248073Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6248136Z unimplemented [] 2025-12-04T10:01:25.6248250Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6248444Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6249665Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6249731Z graph_break [] 2025-12-04T10:01:25.6249863Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6249937Z Autotune Choices Stats: 2025-12-04T10:01:25.6251340Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_689", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010367999784648418, "best_triton_pos": 0} 2025-12-04T10:01:25.6251625Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6251845Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6252169Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6253305Z triton_flex_attention_689 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6254432Z triton_flex_attention_688 0.0113 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6255832Z triton_flex_attention_684 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6257120Z triton_flex_attention_686 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6258294Z triton_flex_attention_687 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6259423Z triton_flex_attention_685 0.0164 ms 63.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6259677Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3120 seconds precompiling for 6 choices 2025-12-04T10:01:25.6259744Z Autotune Choices Stats: 2025-12-04T10:01:25.6261192Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_690", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.6261692Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6262027Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6262598Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6263775Z triton_flex_attention_backward_690 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6264983Z triton_flex_attention_backward_691 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6266190Z triton_flex_attention_backward_692 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6267447Z triton_flex_attention_backward_693 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6268617Z triton_flex_attention_backward_695 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6269774Z triton_flex_attention_backward_696 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6270970Z triton_flex_attention_backward_697 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6272132Z triton_flex_attention_backward_694 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6273291Z triton_flex_attention_backward_699 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6274512Z triton_flex_attention_backward_702 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6274759Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.3417 seconds precompiling for 13 choices 2025-12-04T10:01:25.6274896Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6274967Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6275072Z unimplemented [] 2025-12-04T10:01:25.6275186Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6275376Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6276571Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6276634Z graph_break [] 2025-12-04T10:01:25.6276761Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6276836Z Autotune Choices Stats: 2025-12-04T10:01:25.6278236Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6278520Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6278750Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6279074Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6280204Z triton_flex_attention_707 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6281336Z triton_flex_attention_708 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6282532Z triton_flex_attention_705 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6283702Z triton_flex_attention_706 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6284823Z triton_flex_attention_703 0.0144 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6285951Z triton_flex_attention_704 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6286239Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.2972 seconds precompiling for 6 choices 2025-12-04T10:01:25.6286308Z Autotune Choices Stats: 2025-12-04T10:01:25.6287747Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_709", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.6288190Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6288522Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6289096Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6290328Z triton_flex_attention_backward_709 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6291534Z triton_flex_attention_backward_710 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6292731Z triton_flex_attention_backward_711 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6293898Z triton_flex_attention_backward_712 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6295066Z triton_flex_attention_backward_714 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6296259Z triton_flex_attention_backward_715 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6297423Z triton_flex_attention_backward_713 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6298594Z triton_flex_attention_backward_716 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6299819Z triton_flex_attention_backward_721 0.0174 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6301023Z triton_flex_attention_backward_717 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6301273Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.1509 seconds precompiling for 13 choices 2025-12-04T10:01:25.6301412Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6301481Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6301544Z unimplemented [] 2025-12-04T10:01:25.6301656Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6301846Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6303044Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6303147Z graph_break [] 2025-12-04T10:01:25.6303280Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6303355Z Autotune Choices Stats: 2025-12-04T10:01:25.6304757Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.6305019Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6305243Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6305562Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6306715Z triton_flex_attention_726 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6307920Z triton_flex_attention_727 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6309081Z triton_flex_attention_722 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6310243Z triton_flex_attention_724 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6311376Z triton_flex_attention_725 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6312503Z triton_flex_attention_723 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6312779Z SingleProcess AUTOTUNE benchmarking takes 0.2941 seconds and 1.2814 seconds precompiling for 6 choices 2025-12-04T10:01:25.6312852Z Autotune Choices Stats: 2025-12-04T10:01:25.6314295Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_729", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.6314736Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6315063Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6315676Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6316883Z triton_flex_attention_backward_729 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6318080Z triton_flex_attention_backward_730 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6319248Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6320413Z triton_flex_attention_backward_733 0.0144 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6321639Z triton_flex_attention_backward_728 0.0153 ms 87.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6322807Z triton_flex_attention_backward_735 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6323972Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6325163Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6326359Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6327558Z triton_flex_attention_backward_740 0.0173 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6327807Z SingleProcess AUTOTUNE benchmarking takes 0.6682 seconds and 2.2393 seconds precompiling for 13 choices 2025-12-04T10:01:25.6327942Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6328013Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6328077Z unimplemented [] 2025-12-04T10:01:25.6328183Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6328369Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6329560Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6329659Z graph_break [] 2025-12-04T10:01:25.6329786Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6329861Z Autotune Choices Stats: 2025-12-04T10:01:25.6331265Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.6331515Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6331735Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6332055Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6333226Z triton_flex_attention_745 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6334388Z triton_flex_attention_746 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6335544Z triton_flex_attention_743 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6336667Z triton_flex_attention_741 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6337794Z triton_flex_attention_744 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6338966Z triton_flex_attention_742 0.0164 ms 62.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6339211Z SingleProcess AUTOTUNE benchmarking takes 0.2954 seconds and 1.3187 seconds precompiling for 6 choices 2025-12-04T10:01:25.6339285Z Autotune Choices Stats: 2025-12-04T10:01:25.6340721Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_750", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.6341194Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6341527Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6342093Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6343383Z triton_flex_attention_backward_750 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6344574Z triton_flex_attention_backward_748 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6345735Z triton_flex_attention_backward_749 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6346940Z triton_flex_attention_backward_753 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6348178Z triton_flex_attention_backward_747 0.0144 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6349353Z triton_flex_attention_backward_752 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6350519Z triton_flex_attention_backward_754 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6352042Z triton_flex_attention_backward_751 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6353271Z triton_flex_attention_backward_756 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6354442Z triton_flex_attention_backward_759 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6354692Z SingleProcess AUTOTUNE benchmarking takes 0.6710 seconds and 2.3823 seconds precompiling for 13 choices 2025-12-04T10:01:25.6354844Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6354916Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6355019Z unimplemented [] 2025-12-04T10:01:25.6355126Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6355598Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6356815Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6356884Z graph_break [] 2025-12-04T10:01:25.6357017Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6357095Z Autotune Choices Stats: 2025-12-04T10:01:25.6358509Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_765", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010304000228643417, "best_triton_pos": 0} 2025-12-04T10:01:25.6358760Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6359072Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6359397Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6360582Z triton_flex_attention_765 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6361761Z triton_flex_attention_764 0.0113 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6362894Z triton_flex_attention_762 0.0133 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6364037Z triton_flex_attention_760 0.0143 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6365215Z triton_flex_attention_763 0.0143 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6366353Z triton_flex_attention_761 0.0154 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6366597Z SingleProcess AUTOTUNE benchmarking takes 0.2951 seconds and 1.3301 seconds precompiling for 6 choices 2025-12-04T10:01:25.6366672Z Autotune Choices Stats: 2025-12-04T10:01:25.6368123Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_767", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.6368601Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6368966Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6369533Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6370741Z triton_flex_attention_backward_767 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6371920Z triton_flex_attention_backward_769 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6373079Z triton_flex_attention_backward_766 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6374293Z triton_flex_attention_backward_768 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6375455Z triton_flex_attention_backward_771 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6376627Z triton_flex_attention_backward_772 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6377854Z triton_flex_attention_backward_770 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6379047Z triton_flex_attention_backward_773 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6380213Z triton_flex_attention_backward_775 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6381376Z triton_flex_attention_backward_778 0.0174 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6381709Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2444 seconds precompiling for 13 choices 2025-12-04T10:01:25.6381840Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6381915Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6381979Z unimplemented [] 2025-12-04T10:01:25.6382087Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6382272Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6383454Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6383523Z graph_break [] 2025-12-04T10:01:25.6383649Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6383721Z Autotune Choices Stats: 2025-12-04T10:01:25.6385120Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_783", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6385407Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6385626Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6385986Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6387160Z triton_flex_attention_783 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6388384Z triton_flex_attention_784 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6389509Z triton_flex_attention_779 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6390675Z triton_flex_attention_781 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6391797Z triton_flex_attention_782 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6392927Z triton_flex_attention_780 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6393168Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3189 seconds precompiling for 6 choices 2025-12-04T10:01:25.6393276Z Autotune Choices Stats: 2025-12-04T10:01:25.6394718Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_786", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.6395193Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6395521Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6396119Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6397298Z triton_flex_attention_backward_786 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6398470Z triton_flex_attention_backward_787 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6399661Z triton_flex_attention_backward_788 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6400830Z triton_flex_attention_backward_785 0.0145 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6402004Z triton_flex_attention_backward_790 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6403197Z triton_flex_attention_backward_791 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6404400Z triton_flex_attention_backward_792 0.0155 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6405592Z triton_flex_attention_backward_789 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6406763Z triton_flex_attention_backward_794 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6407939Z triton_flex_attention_backward_797 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6408216Z SingleProcess AUTOTUNE benchmarking takes 0.6703 seconds and 2.2711 seconds precompiling for 13 choices 2025-12-04T10:01:25.6408345Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6408424Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6408487Z unimplemented [] 2025-12-04T10:01:25.6408594Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6408779Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6409960Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6410038Z graph_break [] 2025-12-04T10:01:25.6410167Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6410283Z Autotune Choices Stats: 2025-12-04T10:01:25.6411685Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_803", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.6411965Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6412185Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6412500Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6413669Z triton_flex_attention_803 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6414800Z triton_flex_attention_802 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6415926Z triton_flex_attention_800 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6417098Z triton_flex_attention_798 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6418229Z triton_flex_attention_801 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6419366Z triton_flex_attention_799 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6419659Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T10:01:25.6419733Z Autotune Choices Stats: 2025-12-04T10:01:25.6421210Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_806", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.6421688Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6422023Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6422588Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6423758Z triton_flex_attention_backward_806 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6424958Z triton_flex_attention_backward_805 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6426127Z triton_flex_attention_backward_807 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6427367Z triton_flex_attention_backward_804 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6428574Z triton_flex_attention_backward_809 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6429771Z triton_flex_attention_backward_810 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6430970Z triton_flex_attention_backward_811 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6432132Z triton_flex_attention_backward_808 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6433303Z triton_flex_attention_backward_812 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6434497Z triton_flex_attention_backward_813 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6434748Z SingleProcess AUTOTUNE benchmarking takes 0.6698 seconds and 2.2839 seconds precompiling for 13 choices 2025-12-04T10:01:25.6434880Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6434957Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6435021Z unimplemented [] 2025-12-04T10:01:25.6435129Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6435317Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6436518Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6436621Z graph_break [] 2025-12-04T10:01:25.6436748Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6436823Z Autotune Choices Stats: 2025-12-04T10:01:25.6438253Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_821", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.6438506Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6438757Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6439076Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6440220Z triton_flex_attention_821 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6441347Z triton_flex_attention_822 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6442504Z triton_flex_attention_817 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6443632Z triton_flex_attention_819 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6444763Z triton_flex_attention_820 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6445930Z triton_flex_attention_818 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6446214Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3176 seconds precompiling for 6 choices 2025-12-04T10:01:25.6446290Z Autotune Choices Stats: 2025-12-04T10:01:25.6447760Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_825", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.6448203Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6448533Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6449098Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6450269Z triton_flex_attention_backward_825 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6451482Z triton_flex_attention_backward_824 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6452636Z triton_flex_attention_backward_826 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6453802Z triton_flex_attention_backward_823 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6455029Z triton_flex_attention_backward_828 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6456552Z triton_flex_attention_backward_827 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6457737Z triton_flex_attention_backward_829 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6458899Z triton_flex_attention_backward_830 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6460114Z triton_flex_attention_backward_832 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6461287Z triton_flex_attention_backward_835 0.0164 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6461536Z SingleProcess AUTOTUNE benchmarking takes 0.6673 seconds and 2.2875 seconds precompiling for 13 choices 2025-12-04T10:01:25.6461672Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6461750Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6461812Z unimplemented [] 2025-12-04T10:01:25.6461912Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6462162Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6463339Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6463409Z graph_break [] 2025-12-04T10:01:25.6463584Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6463660Z Autotune Choices Stats: 2025-12-04T10:01:25.6465095Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_840", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.6465354Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6465579Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6465892Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6467030Z triton_flex_attention_840 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6468261Z triton_flex_attention_841 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6469384Z triton_flex_attention_836 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6470515Z triton_flex_attention_838 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6471674Z triton_flex_attention_839 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6472828Z triton_flex_attention_837 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6473075Z SingleProcess AUTOTUNE benchmarking takes 0.2950 seconds and 1.3350 seconds precompiling for 6 choices 2025-12-04T10:01:25.6473150Z Autotune Choices Stats: 2025-12-04T10:01:25.6474638Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_843", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.6475085Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6475416Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6476025Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6477201Z triton_flex_attention_backward_843 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6478370Z triton_flex_attention_backward_844 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6479531Z triton_flex_attention_backward_845 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6480785Z triton_flex_attention_backward_842 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6482001Z triton_flex_attention_backward_847 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6483161Z triton_flex_attention_backward_846 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6484322Z triton_flex_attention_backward_848 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6485514Z triton_flex_attention_backward_849 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6486683Z triton_flex_attention_backward_851 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6487848Z triton_flex_attention_backward_850 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6488098Z SingleProcess AUTOTUNE benchmarking takes 0.6676 seconds and 2.3506 seconds precompiling for 13 choices 2025-12-04T10:01:25.6488263Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6488340Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6488405Z unimplemented [] 2025-12-04T10:01:25.6488511Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6488707Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6489927Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6489999Z graph_break [] 2025-12-04T10:01:25.6490127Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6490200Z Autotune Choices Stats: 2025-12-04T10:01:25.6491642Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_859", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6491901Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6492119Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6492437Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6493616Z triton_flex_attention_859 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6494740Z triton_flex_attention_860 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6495875Z triton_flex_attention_857 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6497006Z triton_flex_attention_858 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6498195Z triton_flex_attention_855 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6499354Z triton_flex_attention_856 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6499601Z SingleProcess AUTOTUNE benchmarking takes 0.2946 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:25.6499671Z Autotune Choices Stats: 2025-12-04T10:01:25.6501111Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_862", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.6501562Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6501926Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6502495Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6503670Z triton_flex_attention_backward_862 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6504836Z triton_flex_attention_backward_863 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6506035Z triton_flex_attention_backward_864 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6507292Z triton_flex_attention_backward_861 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6508499Z triton_flex_attention_backward_865 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6509673Z triton_flex_attention_backward_866 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6510833Z triton_flex_attention_backward_868 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6512022Z triton_flex_attention_backward_867 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6513183Z triton_flex_attention_backward_870 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6514345Z triton_flex_attention_backward_869 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6514630Z SingleProcess AUTOTUNE benchmarking takes 0.6670 seconds and 2.3594 seconds precompiling for 13 choices 2025-12-04T10:01:25.6514759Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6514834Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6514897Z unimplemented [] 2025-12-04T10:01:25.6515048Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6515254Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6516514Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6516585Z graph_break [] 2025-12-04T10:01:25.6516713Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6516781Z Autotune Choices Stats: 2025-12-04T10:01:25.6518187Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_878", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6518433Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6518698Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6519012Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6520150Z triton_flex_attention_878 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6521270Z triton_flex_attention_879 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6522409Z triton_flex_attention_874 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6523603Z triton_flex_attention_876 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6524759Z triton_flex_attention_877 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6525894Z triton_flex_attention_875 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6526140Z SingleProcess AUTOTUNE benchmarking takes 0.2950 seconds and 1.3095 seconds precompiling for 6 choices 2025-12-04T10:01:25.6526217Z Autotune Choices Stats: 2025-12-04T10:01:25.6527661Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_880", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.6528147Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6528479Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6529049Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6530224Z triton_flex_attention_backward_880 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6531396Z triton_flex_attention_backward_881 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6532623Z triton_flex_attention_backward_882 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6533826Z triton_flex_attention_backward_883 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6534983Z triton_flex_attention_backward_885 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6536151Z triton_flex_attention_backward_886 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6537349Z triton_flex_attention_backward_884 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6538510Z triton_flex_attention_backward_887 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6539679Z triton_flex_attention_backward_889 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6540867Z triton_flex_attention_backward_888 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6541150Z SingleProcess AUTOTUNE benchmarking takes 0.6696 seconds and 2.3839 seconds precompiling for 13 choices 2025-12-04T10:01:25.6541329Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:25.6541418Z Traceback (most recent call last): 2025-12-04T10:01:25.6541724Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:25.6541791Z self.assertTrue( 2025-12-04T10:01:25.6542033Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:25.6542118Z raise self.failureException(msg) 2025-12-04T10:01:25.6542371Z AssertionError: False is not true : Log file /tmp/tmpehryl9m1/flex_attention_configs.json was not created 2025-12-04T10:01:25.6542378Z 2025-12-04T10:01:25.6542515Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:25.6542772Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:25.6542777Z 2025-12-04T10:01:25.6542946Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:25.6543075Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6543158Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6543225Z unimplemented [] 2025-12-04T10:01:25.6543331Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6544540Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:25.6544773Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6544842Z graph_break [] 2025-12-04T10:01:25.6544971Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6545970Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:25.6546063Z current_size = base.storage().size() 2025-12-04T10:01:25.6546132Z Autotune Choices Stats: 2025-12-04T10:01:25.6547590Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:25.6547897Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6548124Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6548444Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6549617Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6550765Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6551900Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6553014Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6554176Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6555565Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6555853Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:25.6555922Z Autotune Choices Stats: 2025-12-04T10:01:25.6557377Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.6557897Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6558280Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6558899Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6560083Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6561247Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6562460Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6563620Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6564779Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6565934Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6567158Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6568350Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6569511Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6570672Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6570957Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:25.6571093Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6571165Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6571235Z unimplemented [] 2025-12-04T10:01:25.6571340Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6571533Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6572728Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6572791Z graph_break [] 2025-12-04T10:01:25.6572929Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6572997Z Autotune Choices Stats: 2025-12-04T10:01:25.6574403Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.6574687Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6574961Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6575282Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6576452Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6577582Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6578711Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6579888Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6581019Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6582145Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6582434Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:25.6582502Z Autotune Choices Stats: 2025-12-04T10:01:25.6583972Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6584412Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6584778Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6585341Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6586523Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6587770Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6588969Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6590133Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6591289Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6592508Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6593694Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6594852Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6596001Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6597196Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6597448Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:25.6597579Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6597651Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6597722Z unimplemented [] 2025-12-04T10:01:25.6597825Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6598014Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6599199Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6599310Z graph_break [] 2025-12-04T10:01:25.6599447Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6599515Z Autotune Choices Stats: 2025-12-04T10:01:25.6600959Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.6601211Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6601441Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6601794Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6602934Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6604065Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6605231Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6606342Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6607468Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6608594Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6608877Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:25.6608960Z Autotune Choices Stats: 2025-12-04T10:01:25.6610436Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6610922Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6611261Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6611823Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6613002Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6614208Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6615381Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6616543Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6617742Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6618944Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6620131Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6621295Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6622452Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6623650Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6623904Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:25.6624033Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6624104Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6624172Z unimplemented [] 2025-12-04T10:01:25.6624272Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6624462Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6625652Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6625750Z graph_break [] 2025-12-04T10:01:25.6625884Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6625951Z Autotune Choices Stats: 2025-12-04T10:01:25.6627446Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.6627728Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6627957Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6628271Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6629403Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6630538Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6631704Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6632831Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6633960Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6635142Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6635394Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:25.6635462Z Autotune Choices Stats: 2025-12-04T10:01:25.6636948Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6637390Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6637722Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6638283Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6639495Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6640655Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6641823Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6642979Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6644227Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6645436Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6646606Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6647768Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6648956Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6650117Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6650369Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:25.6650499Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6650570Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6650676Z unimplemented [] 2025-12-04T10:01:25.6650779Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6650965Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6652158Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6652297Z graph_break [] 2025-12-04T10:01:25.6652442Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6652513Z Autotune Choices Stats: 2025-12-04T10:01:25.6653953Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.6654198Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6654420Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6654740Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6656179Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6657389Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6658516Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6659633Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6660809Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6661978Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6662274Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:25.6662345Z Autotune Choices Stats: 2025-12-04T10:01:25.6663790Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6664228Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6664575Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6665178Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6666360Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6667590Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6668762Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6669997Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6671198Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6672373Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6673531Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6674723Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6675881Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6677054Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6677368Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:25.6677510Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6677581Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6677656Z unimplemented [] 2025-12-04T10:01:25.6677758Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6677947Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6679180Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6679247Z graph_break [] 2025-12-04T10:01:25.6679412Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6679484Z Autotune Choices Stats: 2025-12-04T10:01:25.6680899Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.6681146Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6681368Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6681689Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6682863Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6683988Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6685121Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6686249Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6687433Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6688585Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6688835Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:25.6688903Z Autotune Choices Stats: 2025-12-04T10:01:25.6690349Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6690786Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6691154Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6691715Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6692892Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6694059Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6695262Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6696457Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6697672Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6698844Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6700006Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6701204Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6702358Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6703516Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6703796Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:25.6703929Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6703999Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6704103Z unimplemented [] 2025-12-04T10:01:25.6704208Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6704397Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6705614Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6705679Z graph_break [] 2025-12-04T10:01:25.6705816Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6705883Z Autotune Choices Stats: 2025-12-04T10:01:25.6707338Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.6707586Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6707858Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6708193Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6709330Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6710458Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6711579Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6712778Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6713938Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6715067Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6715316Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:25.6715386Z Autotune Choices Stats: 2025-12-04T10:01:25.6716837Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.6717310Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6717645Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6718210Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6719389Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6720585Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6721780Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6722967Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6724139Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6725310Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6726506Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6727672Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6728828Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6730055Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6730301Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:25.6730447Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6730520Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6730602Z unimplemented [] 2025-12-04T10:01:25.6730707Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6730927Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6732123Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6732196Z graph_break [] 2025-12-04T10:01:25.6732334Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6732404Z Autotune Choices Stats: 2025-12-04T10:01:25.6733797Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:25.6734088Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6734307Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6734631Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6735768Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6736905Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6738063Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6739231Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6740401Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6741533Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6741785Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:25.6741895Z Autotune Choices Stats: 2025-12-04T10:01:25.6743348Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6743785Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6744122Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6744685Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6745867Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6747094Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6748349Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6749509Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6750674Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6751865Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6753018Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6754180Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6755665Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6756915Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6757216Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:25.6757359Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6757434Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6757498Z unimplemented [] 2025-12-04T10:01:25.6757607Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6757793Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6758978Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6759051Z graph_break [] 2025-12-04T10:01:25.6759186Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6759304Z Autotune Choices Stats: 2025-12-04T10:01:25.6760715Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:25.6760977Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6761199Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6761520Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6762656Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6763852Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6765006Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6766164Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6767291Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6768415Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6768698Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:25.6768766Z Autotune Choices Stats: 2025-12-04T10:01:25.6770214Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6770650Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6770986Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6771547Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6772821Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6774018Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6775183Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6776341Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6777532Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6778692Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6779863Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6781024Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6782247Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6783446Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6783694Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:25.6783831Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6783901Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6783964Z unimplemented [] 2025-12-04T10:01:25.6784081Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6784270Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6785455Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6785553Z graph_break [] 2025-12-04T10:01:25.6785694Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6785762Z Autotune Choices Stats: 2025-12-04T10:01:25.6787166Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.6787502Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6787724Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6788052Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6789192Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6790440Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6791590Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6792735Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6793863Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6795015Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6795271Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:25.6795338Z Autotune Choices Stats: 2025-12-04T10:01:25.6796782Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6797215Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6797586Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6798148Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6799632Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6800839Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6802004Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6803168Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6804377Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6805538Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6806701Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6807945Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6809132Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6810299Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6810543Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:25.6810680Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6810749Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6810813Z unimplemented [] 2025-12-04T10:01:25.6810924Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6811158Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6812349Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6812414Z graph_break [] 2025-12-04T10:01:25.6812541Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6812618Z Autotune Choices Stats: 2025-12-04T10:01:25.6814022Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:25.6814271Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6814489Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6814844Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6816003Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6817152Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6818283Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6819407Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6820573Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6821701Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6821952Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:25.6822019Z Autotune Choices Stats: 2025-12-04T10:01:25.6823451Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6823929Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6824259Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6824856Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6826054Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6827285Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6828457Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6829655Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6830819Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6831983Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6833184Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6834410Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6835571Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6836743Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6836988Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:25.6837176Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6837245Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6837309Z unimplemented [] 2025-12-04T10:01:25.6837420Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6837605Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6838795Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6838860Z graph_break [] 2025-12-04T10:01:25.6838986Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6839059Z Autotune Choices Stats: 2025-12-04T10:01:25.6840462Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.6840754Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6840972Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6841291Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6842455Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6843619Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6844746Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6845996Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6847150Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6848277Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6848528Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:25.6848607Z Autotune Choices Stats: 2025-12-04T10:01:25.6850052Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6850564Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6850895Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6851493Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6852662Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6853841Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6855078Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6856533Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6857710Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6858948Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6860163Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6861371Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6862533Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6863702Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6863998Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:25.6864138Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6864209Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6864273Z unimplemented [] 2025-12-04T10:01:25.6864381Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6864568Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6865762Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6865825Z graph_break [] 2025-12-04T10:01:25.6865955Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6866033Z Autotune Choices Stats: 2025-12-04T10:01:25.6867514Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:25.6867811Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6868067Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6868393Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6869589Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6870732Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6871858Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6873032Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6874152Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6875278Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6875567Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:25.6875635Z Autotune Choices Stats: 2025-12-04T10:01:25.6877128Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.6877574Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6877935Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6878505Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6879680Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6880852Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6882054Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6883212Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6884373Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6885600Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6886806Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6887984Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6889143Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6890334Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6890582Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:25.6890730Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6890804Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6890870Z unimplemented [] 2025-12-04T10:01:25.6890983Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6891169Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6892353Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6892454Z graph_break [] 2025-12-04T10:01:25.6892586Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6892660Z Autotune Choices Stats: 2025-12-04T10:01:25.6894091Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.6894346Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6894597Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6894920Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6896062Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6897190Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6898345Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6899486Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6900611Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6901772Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6902021Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:25.6902089Z Autotune Choices Stats: 2025-12-04T10:01:25.6903616Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6904059Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6904393Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6904960Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6906136Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6907402Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6908574Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6909751Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6910993Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6912183Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6913355Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6914524Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6915720Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6916892Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6917144Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:25.6917281Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6917352Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6917415Z unimplemented [] 2025-12-04T10:01:25.6917524Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6917710Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6918933Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6918996Z graph_break [] 2025-12-04T10:01:25.6919124Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6919201Z Autotune Choices Stats: 2025-12-04T10:01:25.6920663Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.6920927Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6921148Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6921467Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6922599Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6923803Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6924926Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6926062Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6927182Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6928376Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6928617Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:25.6928694Z Autotune Choices Stats: 2025-12-04T10:01:25.6930172Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.6930618Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6930946Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6931516Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6932727Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6933905Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6935075Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6936292Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6937504Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6938692Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6939872Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6941049Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6942242Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6943408Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6943655Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:25.6943790Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6943897Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6943963Z unimplemented [] 2025-12-04T10:01:25.6944071Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6944255Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6945474Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6945538Z graph_break [] 2025-12-04T10:01:25.6945666Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6945741Z Autotune Choices Stats: 2025-12-04T10:01:25.6947178Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.6947473Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6947689Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6948007Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6949140Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6950303Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6951423Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6952555Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6953754Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6954917Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6955160Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:25.6955449Z Autotune Choices Stats: 2025-12-04T10:01:25.6957148Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6957600Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6958008Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6958576Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6959752Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6960922Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6962086Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6963362Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6964586Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6965749Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6966914Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6968129Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6969292Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6970456Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.6970738Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:25.6970887Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.6970958Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.6971023Z unimplemented [] 2025-12-04T10:01:25.6971133Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.6971352Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.6972563Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.6972634Z graph_break [] 2025-12-04T10:01:25.6972762Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.6972840Z Autotune Choices Stats: 2025-12-04T10:01:25.6974237Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.6974488Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6974707Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6975062Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6976188Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6977340Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6978461Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.6979625Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.6980775Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6981939Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6987834Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:25.6987959Z Autotune Choices Stats: 2025-12-04T10:01:25.6989441Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.6989988Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.6990335Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.6990912Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.6992117Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6993311Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6994552Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.6995759Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.6996939Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.6998094Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.6999293Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7000451Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7001625Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7002824Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.7003078Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:25.7003252Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7003344Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7003412Z unimplemented [] 2025-12-04T10:01:25.7003530Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7003722Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7004971Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7005043Z graph_break [] 2025-12-04T10:01:25.7005180Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7005253Z Autotune Choices Stats: 2025-12-04T10:01:25.7006687Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.7006973Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7007199Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7007517Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7008660Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7009796Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7010951Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7012126Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7013279Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7014416Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7014665Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:25.7014737Z Autotune Choices Stats: 2025-12-04T10:01:25.7016175Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.7016650Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7016982Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7017552Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7018722Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7019930Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7021152Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7022331Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7023495Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7024652Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7025849Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7027015Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7028285Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7029522Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7029773Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:25.7029906Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7030014Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7030081Z unimplemented [] 2025-12-04T10:01:25.7030191Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7030380Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7031566Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7031647Z graph_break [] 2025-12-04T10:01:25.7031783Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7031856Z Autotune Choices Stats: 2025-12-04T10:01:25.7033265Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.7033553Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7033774Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7034092Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7035238Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7036367Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7037600Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7038762Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7039897Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7041026Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7041306Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:25.7041381Z Autotune Choices Stats: 2025-12-04T10:01:25.7042840Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.7043285Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7043620Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7044186Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7045397Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7046603Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7047797Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7048971Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7050137Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7051333Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7052503Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7053662Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7054902Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.7056537Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7056805Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:25.7056941Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7057019Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7057085Z unimplemented [] 2025-12-04T10:01:25.7057201Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7057391Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7058579Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7058718Z graph_break [] 2025-12-04T10:01:25.7058848Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7058923Z Autotune Choices Stats: 2025-12-04T10:01:25.7060333Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.7060587Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7060810Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7061127Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7062263Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7063486Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7064688Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7065813Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7066946Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7068144Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7068431Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:25.7068504Z Autotune Choices Stats: 2025-12-04T10:01:25.7069952Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.7070397Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7070729Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7071359Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7072577Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7073776Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7074936Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7076105Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7077327Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7078490Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7079653Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7080846Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7082041Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7083232Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.7083481Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:25.7083613Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7083691Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7083755Z unimplemented [] 2025-12-04T10:01:25.7083856Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7084053Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7085256Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7085357Z graph_break [] 2025-12-04T10:01:25.7085485Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7085560Z Autotune Choices Stats: 2025-12-04T10:01:25.7086969Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.7087221Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7087442Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7087757Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7088932Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7090077Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7091236Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7092370Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7093502Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7094663Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7094911Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:25.7094987Z Autotune Choices Stats: 2025-12-04T10:01:25.7096433Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.7096914Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7097246Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7097848Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7099051Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7100228Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7101393Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7102613Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7103774Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7104936Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7106140Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7107384Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7108585Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7109747Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.7110002Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:25.7110136Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7110248Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7110315Z unimplemented [] 2025-12-04T10:01:25.7110424Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7110612Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7111800Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7111872Z graph_break [] 2025-12-04T10:01:25.7112012Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7112087Z Autotune Choices Stats: 2025-12-04T10:01:25.7113497Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.7113785Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7114010Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7114329Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7115504Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7116653Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7117781Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7118915Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7120072Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7121201Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7121446Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:25.7121521Z Autotune Choices Stats: 2025-12-04T10:01:25.7122968Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:25.7123454Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7123822Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7124397Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7125614Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7126789Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7127959Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7129164Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7130331Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7131499Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7132739Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7133947Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7135116Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7136278Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.7136568Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:25.7136702Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7136781Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7136845Z unimplemented [] 2025-12-04T10:01:25.7136951Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7137156Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7138335Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7138403Z graph_break [] 2025-12-04T10:01:25.7138535Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7138604Z Autotune Choices Stats: 2025-12-04T10:01:25.7140009Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.7140292Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7140517Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7140864Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7142039Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7143164Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7144293Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7145456Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7146586Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7147757Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7147999Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:25.7148109Z Autotune Choices Stats: 2025-12-04T10:01:25.7149575Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.7150016Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7150378Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7150954Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7152135Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7153313Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7154513Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7155973Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7157152Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7158397Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7159673Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7160835Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7161995Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7163205Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7163457Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:25.7165680Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7165791Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7165877Z unimplemented [] 2025-12-04T10:01:25.7165992Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7166191Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7167391Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7167455Z graph_break [] 2025-12-04T10:01:25.7167680Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7167776Z Autotune Choices Stats: 2025-12-04T10:01:25.7169225Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.7169482Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7169706Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7170033Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7171169Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7172289Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7173405Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7174649Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7175774Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7176884Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7177176Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:25.7177247Z Autotune Choices Stats: 2025-12-04T10:01:25.7178717Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.7179163Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7179493Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7180051Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7181216Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7182418Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7183613Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7184761Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7185956Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7187149Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7188389Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7189544Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7190694Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7191928Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.7192179Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:25.7192320Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7192391Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7192454Z unimplemented [] 2025-12-04T10:01:25.7192565Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7192752Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7193938Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7194037Z graph_break [] 2025-12-04T10:01:25.7194167Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7194242Z Autotune Choices Stats: 2025-12-04T10:01:25.7195673Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.7195932Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7196161Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7196482Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7197614Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7198730Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7199931Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7201063Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7202183Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7203371Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7203655Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:25.7203725Z Autotune Choices Stats: 2025-12-04T10:01:25.7205153Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.7205595Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7205923Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7206511Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7207674Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7208904Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7210070Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7211222Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7212446Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7213595Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7214766Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7215936Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7217120Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7218315Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.7218564Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:25.7218700Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7218770Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7218833Z unimplemented [] 2025-12-04T10:01:25.7218977Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7219165Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7220349Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7220422Z graph_break [] 2025-12-04T10:01:25.7220586Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7220660Z Autotune Choices Stats: 2025-12-04T10:01:25.7222047Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.7222302Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7222524Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7222858Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7223982Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7225135Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7226290Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7227464Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7228616Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7229764Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7230014Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:25.7230082Z Autotune Choices Stats: 2025-12-04T10:01:25.7231513Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:25.7231955Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7232282Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7232903Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7234112Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7235277Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7236444Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7237674Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7238823Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7239974Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7241127Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7242312Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7243500Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7244652Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.7244927Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:25.7245077Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7245147Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7245217Z unimplemented [] 2025-12-04T10:01:25.7245324Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7245509Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7246736Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7246803Z graph_break [] 2025-12-04T10:01:25.7246939Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7247005Z Autotune Choices Stats: 2025-12-04T10:01:25.7248403Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.7248649Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7248870Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7249189Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7250352Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7251502Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7252629Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7253742Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7255187Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7256645Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7256907Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:25.7256977Z Autotune Choices Stats: 2025-12-04T10:01:25.7258427Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.7258865Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7259276Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7259831Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7261084Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7262259Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7263479Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7264699Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7265856Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7267021Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7268245Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7269477Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7270629Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7271791Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7272082Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:25.7272227Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7272297Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7272365Z unimplemented [] 2025-12-04T10:01:25.7272502Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7272692Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7273891Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7273956Z graph_break [] 2025-12-04T10:01:25.7274093Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7274161Z Autotune Choices Stats: 2025-12-04T10:01:25.7275581Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.7275835Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7276101Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7276423Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7277602Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7278723Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7279841Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7281021Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7282143Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7283259Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7283513Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:25.7283594Z Autotune Choices Stats: 2025-12-04T10:01:25.7285040Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.7285515Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7285885Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7286452Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7287631Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7288790Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7290012Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7291168Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7292323Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7293479Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7294692Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7295852Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7297015Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7298248Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7298499Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:25.7298637Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7298710Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7298780Z unimplemented [] 2025-12-04T10:01:25.7298883Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7299070Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7300262Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7300326Z graph_break [] 2025-12-04T10:01:25.7300462Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7300532Z Autotune Choices Stats: 2025-12-04T10:01:25.7301931Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.7302224Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7302445Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7302805Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7303932Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7305056Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7306201Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7307401Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7308524Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7309639Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7309889Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:25.7309991Z Autotune Choices Stats: 2025-12-04T10:01:25.7311437Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:25.7311924Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7312280Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7312854Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7314027Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7315250Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7316410Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7317570Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7318745Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7319960Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7321145Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7322302Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7323479Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7324668Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7324916Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:25.7325050Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7325122Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7325191Z unimplemented [] 2025-12-04T10:01:25.7325291Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7325474Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7326662Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7326735Z graph_break [] 2025-12-04T10:01:25.7326870Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7326973Z Autotune Choices Stats: 2025-12-04T10:01:25.7328373Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.7328675Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7328895Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7329227Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7330357Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7331508Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7332647Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7333770Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7334887Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7335996Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7336278Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:25.7336347Z Autotune Choices Stats: 2025-12-04T10:01:25.7337862Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.7338299Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7338633Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7339189Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7340390Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7341583Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7342744Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7343902Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7345099Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7346295Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7347519Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7348679Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7349909Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7351071Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7351316Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:25.7351452Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7351522Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7351583Z unimplemented [] 2025-12-04T10:01:25.7351689Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7351872Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7353061Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7353159Z graph_break [] 2025-12-04T10:01:25.7353292Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7353362Z Autotune Choices Stats: 2025-12-04T10:01:25.7354790Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_574", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.7355041Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7355475Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7355864Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7356998Z triton_flex_attention_574 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7358242Z triton_flex_attention_575 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7359366Z triton_flex_attention_572 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7360483Z triton_flex_attention_570 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7361607Z triton_flex_attention_573 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7362800Z triton_flex_attention_571 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7363048Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3286 seconds precompiling for 6 choices 2025-12-04T10:01:25.7363170Z Autotune Choices Stats: 2025-12-04T10:01:25.7364613Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014303999952971935, "best_triton_pos": 0} 2025-12-04T10:01:25.7365049Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7365422Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7365977Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7367188Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7368345Z triton_flex_attention_backward_576 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7369500Z triton_flex_attention_backward_577 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7370659Z triton_flex_attention_backward_578 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7371893Z triton_flex_attention_backward_581 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7373057Z triton_flex_attention_backward_583 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7374220Z triton_flex_attention_backward_580 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7375446Z triton_flex_attention_backward_582 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7376608Z triton_flex_attention_backward_585 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7377773Z triton_flex_attention_backward_588 0.0174 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.7378016Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2260 seconds precompiling for 13 choices 2025-12-04T10:01:25.7378157Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7378227Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7378292Z unimplemented [] 2025-12-04T10:01:25.7378400Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7378621Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7379814Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7379876Z graph_break [] 2025-12-04T10:01:25.7380043Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7380115Z Autotune Choices Stats: 2025-12-04T10:01:25.7381507Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.7381758Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7381978Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7382340Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7383502Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7384621Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7385740Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7386866Z triton_flex_attention_589 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7388091Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7389239Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7389490Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.2961 seconds precompiling for 6 choices 2025-12-04T10:01:25.7389555Z Autotune Choices Stats: 2025-12-04T10:01:25.7390988Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.7391463Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7391793Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7392402Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7393574Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7394739Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7395905Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7397102Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7398291Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7399448Z triton_flex_attention_backward_601 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7400648Z triton_flex_attention_backward_602 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7401837Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7402995Z triton_flex_attention_backward_604 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7404159Z triton_flex_attention_backward_603 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7404403Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.3242 seconds precompiling for 13 choices 2025-12-04T10:01:25.7404569Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7404637Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7404698Z unimplemented [] 2025-12-04T10:01:25.7404803Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7404997Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7406216Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7406280Z graph_break [] 2025-12-04T10:01:25.7406406Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7406477Z Autotune Choices Stats: 2025-12-04T10:01:25.7407862Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.7408143Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7408364Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7408683Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7409838Z triton_flex_attention_612 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7410954Z triton_flex_attention_613 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7412073Z triton_flex_attention_608 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7413196Z triton_flex_attention_610 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7414378Z triton_flex_attention_611 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7415497Z triton_flex_attention_609 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7415748Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3064 seconds precompiling for 6 choices 2025-12-04T10:01:25.7415813Z Autotune Choices Stats: 2025-12-04T10:01:25.7417240Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_616", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.7417746Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7418086Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7418650Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7419816Z triton_flex_attention_backward_616 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7420973Z triton_flex_attention_backward_615 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7422174Z triton_flex_attention_backward_617 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7423361Z triton_flex_attention_backward_614 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7424527Z triton_flex_attention_backward_619 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7425730Z triton_flex_attention_backward_620 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7426949Z triton_flex_attention_backward_618 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7428200Z triton_flex_attention_backward_621 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7429356Z triton_flex_attention_backward_623 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7430520Z triton_flex_attention_backward_622 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7430801Z SingleProcess AUTOTUNE benchmarking takes 0.6701 seconds and 2.3164 seconds precompiling for 13 choices 2025-12-04T10:01:25.7430936Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7431005Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7431069Z unimplemented [] 2025-12-04T10:01:25.7431178Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7431394Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7432585Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7432653Z graph_break [] 2025-12-04T10:01:25.7432780Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7432853Z Autotune Choices Stats: 2025-12-04T10:01:25.7434248Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.7434530Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7434778Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7435096Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7436230Z triton_flex_attention_631 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7437351Z triton_flex_attention_632 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7438470Z triton_flex_attention_629 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7439622Z triton_flex_attention_627 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7440764Z triton_flex_attention_630 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7441891Z triton_flex_attention_628 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7442176Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.2948 seconds precompiling for 6 choices 2025-12-04T10:01:25.7442244Z Autotune Choices Stats: 2025-12-04T10:01:25.7443721Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.7444167Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7444497Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7445059Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7446224Z triton_flex_attention_backward_634 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7447407Z triton_flex_attention_backward_635 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7448638Z triton_flex_attention_backward_636 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7449797Z triton_flex_attention_backward_633 0.0144 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7450956Z triton_flex_attention_backward_638 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7452169Z triton_flex_attention_backward_640 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7453343Z triton_flex_attention_backward_637 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7454501Z triton_flex_attention_backward_639 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7455938Z triton_flex_attention_backward_642 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7457196Z triton_flex_attention_backward_641 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7457519Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.2589 seconds precompiling for 13 choices 2025-12-04T10:01:25.7457660Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7457728Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7457792Z unimplemented [] 2025-12-04T10:01:25.7457900Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7458081Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7459260Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7459373Z graph_break [] 2025-12-04T10:01:25.7459499Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7459570Z Autotune Choices Stats: 2025-12-04T10:01:25.7461016Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.7461270Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7461490Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7461809Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7462960Z triton_flex_attention_650 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7464080Z triton_flex_attention_651 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7465228Z triton_flex_attention_646 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7466384Z triton_flex_attention_648 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7467580Z triton_flex_attention_649 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7468743Z triton_flex_attention_647 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7468996Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3235 seconds precompiling for 6 choices 2025-12-04T10:01:25.7469065Z Autotune Choices Stats: 2025-12-04T10:01:25.7470538Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_653", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.7470983Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7471315Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7471879Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7473056Z triton_flex_attention_backward_653 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7474337Z triton_flex_attention_backward_654 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7475504Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7476668Z triton_flex_attention_backward_652 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7477902Z triton_flex_attention_backward_656 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7479051Z triton_flex_attention_backward_657 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7480215Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7481396Z triton_flex_attention_backward_658 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7482578Z triton_flex_attention_backward_661 0.0164 ms 80.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7483772Z triton_flex_attention_backward_660 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7484020Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.4000 seconds precompiling for 13 choices 2025-12-04T10:01:25.7484156Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7484226Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7484298Z unimplemented [] 2025-12-04T10:01:25.7484408Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7484590Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7485815Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7485878Z graph_break [] 2025-12-04T10:01:25.7486005Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7486085Z Autotune Choices Stats: 2025-12-04T10:01:25.7487512Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.7487764Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7487983Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7488311Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7489435Z triton_flex_attention_669 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7490566Z triton_flex_attention_670 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7491764Z triton_flex_attention_665 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7492890Z triton_flex_attention_667 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7494004Z triton_flex_attention_668 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7495193Z triton_flex_attention_666 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7495440Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.2940 seconds precompiling for 6 choices 2025-12-04T10:01:25.7495517Z Autotune Choices Stats: 2025-12-04T10:01:25.7496962Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.7497409Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7497740Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7498300Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7499511Z triton_flex_attention_backward_673 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7500716Z triton_flex_attention_backward_674 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7501886Z triton_flex_attention_backward_671 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7503081Z triton_flex_attention_backward_672 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7504278Z triton_flex_attention_backward_676 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7505438Z triton_flex_attention_backward_675 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7506605Z triton_flex_attention_backward_677 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7507832Z triton_flex_attention_backward_678 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7509070Z triton_flex_attention_backward_680 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7510250Z triton_flex_attention_backward_683 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.7510500Z SingleProcess AUTOTUNE benchmarking takes 0.6679 seconds and 2.3879 seconds precompiling for 13 choices 2025-12-04T10:01:25.7510635Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7510749Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7510815Z unimplemented [] 2025-12-04T10:01:25.7510924Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7511113Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7512339Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7512402Z graph_break [] 2025-12-04T10:01:25.7512529Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7512601Z Autotune Choices Stats: 2025-12-04T10:01:25.7513998Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_689", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010367999784648418, "best_triton_pos": 0} 2025-12-04T10:01:25.7514250Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7514467Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7514791Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7515924Z triton_flex_attention_689 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7517131Z triton_flex_attention_688 0.0113 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7518249Z triton_flex_attention_684 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7519378Z triton_flex_attention_686 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7520527Z triton_flex_attention_687 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7521702Z triton_flex_attention_685 0.0164 ms 63.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7521950Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3120 seconds precompiling for 6 choices 2025-12-04T10:01:25.7522023Z Autotune Choices Stats: 2025-12-04T10:01:25.7523452Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_690", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.7523893Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7524222Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7524830Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7526035Z triton_flex_attention_backward_690 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7527220Z triton_flex_attention_backward_691 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7528389Z triton_flex_attention_backward_692 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7529612Z triton_flex_attention_backward_693 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7530776Z triton_flex_attention_backward_695 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7531928Z triton_flex_attention_backward_696 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7533085Z triton_flex_attention_backward_697 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7534306Z triton_flex_attention_backward_694 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7535465Z triton_flex_attention_backward_699 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7536625Z triton_flex_attention_backward_702 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.7536899Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.3417 seconds precompiling for 13 choices 2025-12-04T10:01:25.7537033Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7537112Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7537175Z unimplemented [] 2025-12-04T10:01:25.7537282Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7537462Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7538681Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7538745Z graph_break [] 2025-12-04T10:01:25.7538873Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7538948Z Autotune Choices Stats: 2025-12-04T10:01:25.7540343Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.7540594Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7540814Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7541167Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7542300Z triton_flex_attention_707 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7543456Z triton_flex_attention_708 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7544570Z triton_flex_attention_705 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7545738Z triton_flex_attention_706 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7546884Z triton_flex_attention_703 0.0144 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7548046Z triton_flex_attention_704 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7548293Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.2972 seconds precompiling for 6 choices 2025-12-04T10:01:25.7548366Z Autotune Choices Stats: 2025-12-04T10:01:25.7549797Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_709", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.7550273Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7550602Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7551206Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7552381Z triton_flex_attention_backward_709 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7553549Z triton_flex_attention_backward_710 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7554792Z triton_flex_attention_backward_711 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7556251Z triton_flex_attention_backward_712 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7557451Z triton_flex_attention_backward_714 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7558618Z triton_flex_attention_backward_715 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7559856Z triton_flex_attention_backward_713 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7561076Z triton_flex_attention_backward_716 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7562249Z triton_flex_attention_backward_721 0.0174 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.7563462Z triton_flex_attention_backward_717 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7563711Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.1509 seconds precompiling for 13 choices 2025-12-04T10:01:25.7563895Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7563964Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7564028Z unimplemented [] 2025-12-04T10:01:25.7564138Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7564326Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7565509Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7565577Z graph_break [] 2025-12-04T10:01:25.7565714Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7565794Z Autotune Choices Stats: 2025-12-04T10:01:25.7567205Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.7567495Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7567718Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7568039Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7569210Z triton_flex_attention_726 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7570329Z triton_flex_attention_727 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7571446Z triton_flex_attention_722 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7572638Z triton_flex_attention_724 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7573762Z triton_flex_attention_725 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7574889Z triton_flex_attention_723 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7575140Z SingleProcess AUTOTUNE benchmarking takes 0.2941 seconds and 1.2814 seconds precompiling for 6 choices 2025-12-04T10:01:25.7575212Z Autotune Choices Stats: 2025-12-04T10:01:25.7576659Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_729", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.7577132Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7577506Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7578074Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7579250Z triton_flex_attention_backward_729 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7580453Z triton_flex_attention_backward_730 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7581642Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7582818Z triton_flex_attention_backward_733 0.0144 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7583978Z triton_flex_attention_backward_728 0.0153 ms 87.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7585137Z triton_flex_attention_backward_735 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7586389Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7587624Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7588793Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7590040Z triton_flex_attention_backward_740 0.0173 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.7590288Z SingleProcess AUTOTUNE benchmarking takes 0.6682 seconds and 2.2393 seconds precompiling for 13 choices 2025-12-04T10:01:25.7590432Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7590509Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7590571Z unimplemented [] 2025-12-04T10:01:25.7590685Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7590879Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7592058Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7592128Z graph_break [] 2025-12-04T10:01:25.7592260Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7592331Z Autotune Choices Stats: 2025-12-04T10:01:25.7593738Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.7594023Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7594277Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7594591Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7595735Z triton_flex_attention_745 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7596866Z triton_flex_attention_746 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7598052Z triton_flex_attention_743 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7599175Z triton_flex_attention_741 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7600297Z triton_flex_attention_744 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7601428Z triton_flex_attention_742 0.0164 ms 62.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7601710Z SingleProcess AUTOTUNE benchmarking takes 0.2954 seconds and 1.3187 seconds precompiling for 6 choices 2025-12-04T10:01:25.7601782Z Autotune Choices Stats: 2025-12-04T10:01:25.7603251Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_750", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.7603694Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7604026Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7604594Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7605816Z triton_flex_attention_backward_750 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7607058Z triton_flex_attention_backward_748 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7608223Z triton_flex_attention_backward_749 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7609389Z triton_flex_attention_backward_753 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7610555Z triton_flex_attention_backward_747 0.0144 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7611778Z triton_flex_attention_backward_752 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7612944Z triton_flex_attention_backward_754 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7614111Z triton_flex_attention_backward_751 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7615350Z triton_flex_attention_backward_756 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7616658Z triton_flex_attention_backward_759 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.7616909Z SingleProcess AUTOTUNE benchmarking takes 0.6710 seconds and 2.3823 seconds precompiling for 13 choices 2025-12-04T10:01:25.7617040Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7617114Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7617176Z unimplemented [] 2025-12-04T10:01:25.7617290Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7617474Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7618658Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7618785Z graph_break [] 2025-12-04T10:01:25.7618912Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7618988Z Autotune Choices Stats: 2025-12-04T10:01:25.7620427Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_765", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010304000228643417, "best_triton_pos": 0} 2025-12-04T10:01:25.7620685Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7620908Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7621223Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7622359Z triton_flex_attention_765 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7623526Z triton_flex_attention_764 0.0113 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7624692Z triton_flex_attention_762 0.0133 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7625817Z triton_flex_attention_760 0.0143 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7626939Z triton_flex_attention_763 0.0143 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7628126Z triton_flex_attention_761 0.0154 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7628418Z SingleProcess AUTOTUNE benchmarking takes 0.2951 seconds and 1.3301 seconds precompiling for 6 choices 2025-12-04T10:01:25.7628492Z Autotune Choices Stats: 2025-12-04T10:01:25.7629964Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_767", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.7630409Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7630740Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7631346Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7632541Z triton_flex_attention_backward_767 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7633710Z triton_flex_attention_backward_769 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7634871Z triton_flex_attention_backward_766 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7636042Z triton_flex_attention_backward_768 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7637238Z triton_flex_attention_backward_771 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7638432Z triton_flex_attention_backward_772 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7639594Z triton_flex_attention_backward_770 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7640784Z triton_flex_attention_backward_773 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7641980Z triton_flex_attention_backward_775 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7643145Z triton_flex_attention_backward_778 0.0174 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.7643399Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2444 seconds precompiling for 13 choices 2025-12-04T10:01:25.7643529Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7643603Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7643667Z unimplemented [] 2025-12-04T10:01:25.7643774Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7643960Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7645139Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7645243Z graph_break [] 2025-12-04T10:01:25.7645370Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7645439Z Autotune Choices Stats: 2025-12-04T10:01:25.7646883Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_783", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.7647134Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7647357Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7647669Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7648844Z triton_flex_attention_783 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7650016Z triton_flex_attention_784 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7651140Z triton_flex_attention_779 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7652268Z triton_flex_attention_781 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7653386Z triton_flex_attention_782 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7654586Z triton_flex_attention_780 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7654833Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3189 seconds precompiling for 6 choices 2025-12-04T10:01:25.7654906Z Autotune Choices Stats: 2025-12-04T10:01:25.7656745Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_786", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.7657371Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7657762Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7658464Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7659641Z triton_flex_attention_backward_786 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7660804Z triton_flex_attention_backward_787 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7661959Z triton_flex_attention_backward_788 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7663164Z triton_flex_attention_backward_785 0.0145 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7664375Z triton_flex_attention_backward_790 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7665546Z triton_flex_attention_backward_791 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7666705Z triton_flex_attention_backward_792 0.0155 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7668030Z triton_flex_attention_backward_789 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7669189Z triton_flex_attention_backward_794 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7670365Z triton_flex_attention_backward_797 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.7670614Z SingleProcess AUTOTUNE benchmarking takes 0.6703 seconds and 2.2711 seconds precompiling for 13 choices 2025-12-04T10:01:25.7670747Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7670823Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7670922Z unimplemented [] 2025-12-04T10:01:25.7671024Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7671214Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7672426Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7672496Z graph_break [] 2025-12-04T10:01:25.7672632Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7672707Z Autotune Choices Stats: 2025-12-04T10:01:25.7674104Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_803", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.7674388Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7674607Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7674927Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7676100Z triton_flex_attention_803 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7677230Z triton_flex_attention_802 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7678361Z triton_flex_attention_800 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7679483Z triton_flex_attention_798 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7680640Z triton_flex_attention_801 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7681801Z triton_flex_attention_799 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7682045Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T10:01:25.7682122Z Autotune Choices Stats: 2025-12-04T10:01:25.7683566Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_806", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.7684061Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7684442Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7685013Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7686184Z triton_flex_attention_backward_806 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7687357Z triton_flex_attention_backward_805 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7688516Z triton_flex_attention_backward_807 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7689752Z triton_flex_attention_backward_804 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7690919Z triton_flex_attention_backward_809 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7692082Z triton_flex_attention_backward_810 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7693309Z triton_flex_attention_backward_811 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7694464Z triton_flex_attention_backward_808 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7695632Z triton_flex_attention_backward_812 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7696789Z triton_flex_attention_backward_813 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7697073Z SingleProcess AUTOTUNE benchmarking takes 0.6698 seconds and 2.2839 seconds precompiling for 13 choices 2025-12-04T10:01:25.7697203Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7697282Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7697344Z unimplemented [] 2025-12-04T10:01:25.7697444Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7697638Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7698849Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7698918Z graph_break [] 2025-12-04T10:01:25.7699046Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7699119Z Autotune Choices Stats: 2025-12-04T10:01:25.7700513Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_821", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.7701079Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7701295Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7701642Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7702789Z triton_flex_attention_821 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7703907Z triton_flex_attention_822 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7705030Z triton_flex_attention_817 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7706190Z triton_flex_attention_819 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7707400Z triton_flex_attention_820 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7708542Z triton_flex_attention_818 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7708783Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3176 seconds precompiling for 6 choices 2025-12-04T10:01:25.7708888Z Autotune Choices Stats: 2025-12-04T10:01:25.7710367Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_825", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.7710806Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7711134Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7711700Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7712876Z triton_flex_attention_backward_825 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7714046Z triton_flex_attention_backward_824 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7715240Z triton_flex_attention_backward_826 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7716453Z triton_flex_attention_backward_823 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7717609Z triton_flex_attention_backward_828 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7718796Z triton_flex_attention_backward_827 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7719990Z triton_flex_attention_backward_829 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7721144Z triton_flex_attention_backward_830 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7722310Z triton_flex_attention_backward_832 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7723465Z triton_flex_attention_backward_835 0.0164 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.7723749Z SingleProcess AUTOTUNE benchmarking takes 0.6673 seconds and 2.2875 seconds precompiling for 13 choices 2025-12-04T10:01:25.7723877Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7724010Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7724074Z unimplemented [] 2025-12-04T10:01:25.7724176Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7724367Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7725555Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7725622Z graph_break [] 2025-12-04T10:01:25.7725749Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7725860Z Autotune Choices Stats: 2025-12-04T10:01:25.7727269Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_840", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.7727542Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7727768Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7728082Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7729216Z triton_flex_attention_840 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7730344Z triton_flex_attention_841 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7731474Z triton_flex_attention_836 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7732671Z triton_flex_attention_838 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7733798Z triton_flex_attention_839 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7734928Z triton_flex_attention_837 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7735205Z SingleProcess AUTOTUNE benchmarking takes 0.2950 seconds and 1.3350 seconds precompiling for 6 choices 2025-12-04T10:01:25.7735278Z Autotune Choices Stats: 2025-12-04T10:01:25.7736747Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_843", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.7737188Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7737515Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7738081Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7739250Z triton_flex_attention_backward_843 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7740453Z triton_flex_attention_backward_844 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7741645Z triton_flex_attention_backward_845 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7742806Z triton_flex_attention_backward_842 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7744044Z triton_flex_attention_backward_847 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7745237Z triton_flex_attention_backward_846 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7746410Z triton_flex_attention_backward_848 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7747627Z triton_flex_attention_backward_849 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7748796Z triton_flex_attention_backward_851 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7750053Z triton_flex_attention_backward_850 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7750306Z SingleProcess AUTOTUNE benchmarking takes 0.6676 seconds and 2.3506 seconds precompiling for 13 choices 2025-12-04T10:01:25.7750442Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7750519Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7750595Z unimplemented [] 2025-12-04T10:01:25.7750703Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7750901Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7752085Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7752191Z graph_break [] 2025-12-04T10:01:25.7752322Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7752396Z Autotune Choices Stats: 2025-12-04T10:01:25.7753837Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_859", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.7754086Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7754322Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7754643Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7756069Z triton_flex_attention_859 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7757205Z triton_flex_attention_860 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7758459Z triton_flex_attention_857 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7759593Z triton_flex_attention_858 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7760714Z triton_flex_attention_855 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7761895Z triton_flex_attention_856 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7762199Z SingleProcess AUTOTUNE benchmarking takes 0.2946 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:25.7762275Z Autotune Choices Stats: 2025-12-04T10:01:25.7763717Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_862", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.7764167Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7764498Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7765068Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7766243Z triton_flex_attention_backward_862 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7767487Z triton_flex_attention_backward_863 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7768657Z triton_flex_attention_backward_864 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7769819Z triton_flex_attention_backward_861 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7771047Z triton_flex_attention_backward_865 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7772208Z triton_flex_attention_backward_866 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7773374Z triton_flex_attention_backward_868 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7774547Z triton_flex_attention_backward_867 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7775759Z triton_flex_attention_backward_870 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7776951Z triton_flex_attention_backward_869 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7777206Z SingleProcess AUTOTUNE benchmarking takes 0.6670 seconds and 2.3594 seconds precompiling for 13 choices 2025-12-04T10:01:25.7777340Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7777417Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7777480Z unimplemented [] 2025-12-04T10:01:25.7777585Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7777824Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7779010Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7779080Z graph_break [] 2025-12-04T10:01:25.7779265Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7779335Z Autotune Choices Stats: 2025-12-04T10:01:25.7780746Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_878", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.7780991Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7781217Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7781534Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7782678Z triton_flex_attention_878 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7783830Z triton_flex_attention_879 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7784986Z triton_flex_attention_874 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7786116Z triton_flex_attention_876 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7787343Z triton_flex_attention_877 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7788523Z triton_flex_attention_875 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7788770Z SingleProcess AUTOTUNE benchmarking takes 0.2950 seconds and 1.3095 seconds precompiling for 6 choices 2025-12-04T10:01:25.7788843Z Autotune Choices Stats: 2025-12-04T10:01:25.7790284Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_880", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.7790726Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7791055Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7791655Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7792868Z triton_flex_attention_backward_880 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7794046Z triton_flex_attention_backward_881 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7795224Z triton_flex_attention_backward_882 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7796567Z triton_flex_attention_backward_883 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7797946Z triton_flex_attention_backward_885 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7799120Z triton_flex_attention_backward_886 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7800280Z triton_flex_attention_backward_884 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7801479Z triton_flex_attention_backward_887 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7802679Z triton_flex_attention_backward_889 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7803840Z triton_flex_attention_backward_888 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7804123Z SingleProcess AUTOTUNE benchmarking takes 0.6696 seconds and 2.3839 seconds precompiling for 13 choices 2025-12-04T10:01:25.7804256Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7804332Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7804398Z unimplemented [] 2025-12-04T10:01:25.7804501Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7804696Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7805912Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7805988Z graph_break [] 2025-12-04T10:01:25.7806122Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7806191Z Autotune Choices Stats: 2025-12-04T10:01:25.7807604Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_897", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.7807847Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7808073Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7808389Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7809568Z triton_flex_attention_897 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7810720Z triton_flex_attention_898 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7811849Z triton_flex_attention_893 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7812977Z triton_flex_attention_895 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7814185Z triton_flex_attention_896 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7815313Z triton_flex_attention_894 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7815561Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3269 seconds precompiling for 6 choices 2025-12-04T10:01:25.7815633Z Autotune Choices Stats: 2025-12-04T10:01:25.7817073Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_902", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.7817522Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7817888Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7818448Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7819666Z triton_flex_attention_backward_902 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7820854Z triton_flex_attention_backward_900 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7822048Z triton_flex_attention_backward_901 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7823247Z triton_flex_attention_backward_904 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7824403Z triton_flex_attention_backward_899 0.0145 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7825567Z triton_flex_attention_backward_905 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7826736Z triton_flex_attention_backward_906 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7828053Z triton_flex_attention_backward_903 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7829222Z triton_flex_attention_backward_908 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7830400Z triton_flex_attention_backward_907 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7830689Z SingleProcess AUTOTUNE benchmarking takes 0.6679 seconds and 2.3006 seconds precompiling for 13 choices 2025-12-04T10:01:25.7830866Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:25.7830949Z Traceback (most recent call last): 2025-12-04T10:01:25.7831283Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:25.7831349Z self.assertTrue( 2025-12-04T10:01:25.7831553Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:25.7831638Z raise self.failureException(msg) 2025-12-04T10:01:25.7831882Z AssertionError: False is not true : Log file /tmp/tmpenzy7uo9/flex_attention_configs.json was not created 2025-12-04T10:01:25.7831892Z 2025-12-04T10:01:25.7832028Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:25.7832278Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:25.7832284Z 2025-12-04T10:01:25.7832450Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:25.7832581Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7832650Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7832721Z unimplemented [] 2025-12-04T10:01:25.7832824Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7834023Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:25.7834250Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7834311Z graph_break [] 2025-12-04T10:01:25.7834444Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7835436Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:25.7835525Z current_size = base.storage().size() 2025-12-04T10:01:25.7835625Z Autotune Choices Stats: 2025-12-04T10:01:25.7837262Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:25.7837554Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7837818Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7838234Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7839394Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7840516Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7841638Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7842754Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7843880Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7845077Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7845328Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:25.7845397Z Autotune Choices Stats: 2025-12-04T10:01:25.7846855Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.7847335Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7847677Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7848271Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7849447Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7850616Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7851782Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7852970Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7854166Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7855594Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7856855Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7858083Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7859248Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7860420Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7860674Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:25.7860866Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7860937Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7861007Z unimplemented [] 2025-12-04T10:01:25.7861111Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7861302Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7862557Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7862624Z graph_break [] 2025-12-04T10:01:25.7862761Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7862833Z Autotune Choices Stats: 2025-12-04T10:01:25.7864246Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.7864531Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7864750Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7865080Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7866258Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7867469Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7868610Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7869732Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7870927Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7872055Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7872320Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:25.7872390Z Autotune Choices Stats: 2025-12-04T10:01:25.7873841Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.7874315Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7874683Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7875254Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7876435Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7877594Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7878812Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7880036Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7881214Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7882370Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7883590Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7884764Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7885920Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7887098Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7887378Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:25.7887514Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7887584Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7887651Z unimplemented [] 2025-12-04T10:01:25.7887752Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7887975Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7889154Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7889216Z graph_break [] 2025-12-04T10:01:25.7889352Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7889417Z Autotune Choices Stats: 2025-12-04T10:01:25.7890819Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.7891097Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7891350Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7891673Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7892803Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7893928Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7895055Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7896211Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7897369Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7898489Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7898772Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:25.7898838Z Autotune Choices Stats: 2025-12-04T10:01:25.7900312Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.7900751Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7901088Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7901659Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7902836Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7903998Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7905234Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7906394Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7907647Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7908911Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7910075Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7911240Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7912395Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.7913589Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7913833Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:25.7914000Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7914069Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7914138Z unimplemented [] 2025-12-04T10:01:25.7914240Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7914426Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7915615Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7915710Z graph_break [] 2025-12-04T10:01:25.7915843Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7915910Z Autotune Choices Stats: 2025-12-04T10:01:25.7917610Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.7917908Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7918130Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7918449Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7919581Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7920722Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7921889Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7923049Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7924182Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7925302Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7925588Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:25.7925654Z Autotune Choices Stats: 2025-12-04T10:01:25.7927131Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.7927579Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7927916Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7928477Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7929658Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7930874Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7932083Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7933251Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7934460Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7935663Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7936821Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7937985Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7939142Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7940365Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.7940612Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:25.7940751Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7940824Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7940894Z unimplemented [] 2025-12-04T10:01:25.7940997Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7941186Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7942377Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7942492Z graph_break [] 2025-12-04T10:01:25.7942621Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7942687Z Autotune Choices Stats: 2025-12-04T10:01:25.7944122Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.7944374Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7944595Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7944917Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7946046Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7947186Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7948436Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7949574Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7950694Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7951882Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7952142Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:25.7952212Z Autotune Choices Stats: 2025-12-04T10:01:25.7953667Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.7954104Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7954438Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7955004Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7956510Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7957746Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7958907Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7960074Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7961344Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7962511Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7963659Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7964821Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7966067Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.7967228Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7967481Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:25.7967619Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.7967698Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.7967798Z unimplemented [] 2025-12-04T10:01:25.7967911Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.7968099Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.7969284Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.7969389Z graph_break [] 2025-12-04T10:01:25.7969531Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.7969599Z Autotune Choices Stats: 2025-12-04T10:01:25.7970999Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.7971250Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7971469Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7971791Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7972918Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7974076Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7975246Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.7976383Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.7977540Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7978683Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7978935Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:25.7979002Z Autotune Choices Stats: 2025-12-04T10:01:25.7980446Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.7980882Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.7981217Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.7981813Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.7983025Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7984193Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7985364Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7986599Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7987827Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.7988994Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.7990156Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.7991367Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.7992560Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8000628Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8001002Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:25.8001158Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8001232Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8001302Z unimplemented [] 2025-12-04T10:01:25.8001421Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8001612Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8002849Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8002918Z graph_break [] 2025-12-04T10:01:25.8003056Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8003136Z Autotune Choices Stats: 2025-12-04T10:01:25.8004570Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.8004835Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8005061Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8005382Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8006571Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8007728Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8008839Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8009959Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8011151Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8012268Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8012525Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:25.8012593Z Autotune Choices Stats: 2025-12-04T10:01:25.8014019Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.8014501Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8014829Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8015381Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8016602Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8017759Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8018953Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8020133Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8021287Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8022428Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8023582Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8024796Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8025940Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8027097Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8027486Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:25.8027629Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8027701Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8027806Z unimplemented [] 2025-12-04T10:01:25.8027928Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8028117Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8029305Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8029368Z graph_break [] 2025-12-04T10:01:25.8029496Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8029567Z Autotune Choices Stats: 2025-12-04T10:01:25.8030959Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:25.8031259Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8031481Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8031797Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8032951Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8034072Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8035191Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8036384Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8037498Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8038618Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8038869Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:25.8038936Z Autotune Choices Stats: 2025-12-04T10:01:25.8040371Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.8040848Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8041206Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8041759Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8042926Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8044121Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8045304Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8046460Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8047604Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8048748Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8049994Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8051149Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8052297Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8053509Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8053754Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:25.8053893Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8053964Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8054027Z unimplemented [] 2025-12-04T10:01:25.8054136Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8054321Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8055842Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8055913Z graph_break [] 2025-12-04T10:01:25.8056047Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8056135Z Autotune Choices Stats: 2025-12-04T10:01:25.8057799Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:25.8058199Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8058442Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8058814Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8059939Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8061056Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8062209Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8063418Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8064536Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8065652Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8065914Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:25.8066038Z Autotune Choices Stats: 2025-12-04T10:01:25.8067805Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.8068332Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8068665Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8069223Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8070384Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8071595Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8072746Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8073894Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8075042Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8076232Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8077410Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8078556Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8079724Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8080945Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8081194Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:25.8081328Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8081397Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8081460Z unimplemented [] 2025-12-04T10:01:25.8081565Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8081748Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8082935Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8082998Z graph_break [] 2025-12-04T10:01:25.8083123Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8083229Z Autotune Choices Stats: 2025-12-04T10:01:25.8084622Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.8084915Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8085133Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8085448Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8086571Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8087707Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8088850Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8089966Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8091081Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8092193Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8092471Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:25.8092541Z Autotune Choices Stats: 2025-12-04T10:01:25.8093993Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.8094440Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8094769Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8095354Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8096547Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8097700Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8098850Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8100002Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8101190Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8102370Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8103518Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8104693Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8105869Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8107025Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8107317Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:25.8107453Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8107521Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8107586Z unimplemented [] 2025-12-04T10:01:25.8107700Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8107886Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8109064Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8109166Z graph_break [] 2025-12-04T10:01:25.8109295Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8109369Z Autotune Choices Stats: 2025-12-04T10:01:25.8110794Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:25.8111045Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8111263Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8111574Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8112691Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8113880Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8114988Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8116107Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8117213Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8118356Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8118629Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:25.8118700Z Autotune Choices Stats: 2025-12-04T10:01:25.8120122Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.8120564Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8120925Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8121477Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8122667Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8123819Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8124981Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8126131Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8127348Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8128491Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8129638Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8130849Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8132003Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8133156Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8133404Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:25.8133541Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8133609Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8133673Z unimplemented [] 2025-12-04T10:01:25.8133781Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8134002Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8135190Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8135269Z graph_break [] 2025-12-04T10:01:25.8135429Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8135507Z Autotune Choices Stats: 2025-12-04T10:01:25.8136907Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.8137157Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8137407Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8137715Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8138871Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8139979Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8141092Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8142210Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8143347Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8144519Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8144765Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:25.8144838Z Autotune Choices Stats: 2025-12-04T10:01:25.8146294Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.8146798Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8147127Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8147778Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8148954Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8150113Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8151261Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8152479Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8153631Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8154779Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8156571Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8158078Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8159313Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8160473Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8160791Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:25.8160932Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8161001Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8161066Z unimplemented [] 2025-12-04T10:01:25.8161175Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8161361Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8162597Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8162669Z graph_break [] 2025-12-04T10:01:25.8162801Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8162875Z Autotune Choices Stats: 2025-12-04T10:01:25.8164287Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:25.8164589Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8164818Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8165144Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8166313Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8167444Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8168564Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8169690Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8170868Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8171995Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8172241Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:25.8172313Z Autotune Choices Stats: 2025-12-04T10:01:25.8173757Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.8174271Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8174607Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8175163Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8176346Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8177517Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8178741Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8179937Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8181108Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8182298Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8183489Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8184650Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8185825Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8186985Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8187368Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:25.8187501Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8187576Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8187642Z unimplemented [] 2025-12-04T10:01:25.8187788Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8187977Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8189162Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8189228Z graph_break [] 2025-12-04T10:01:25.8189354Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8189428Z Autotune Choices Stats: 2025-12-04T10:01:25.8190861Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.8191109Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8191357Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8191667Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8192809Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8193937Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8195056Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8196247Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8197357Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8198482Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8198799Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:25.8198876Z Autotune Choices Stats: 2025-12-04T10:01:25.8200356Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.8200803Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8201131Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8201689Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8202855Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8204022Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8205237Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8206399Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8207557Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8208792Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8209947Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8211104Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8212260Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8213477Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8213758Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:25.8213889Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8213964Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8214026Z unimplemented [] 2025-12-04T10:01:25.8214134Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8214319Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8215505Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8215604Z graph_break [] 2025-12-04T10:01:25.8215731Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8215801Z Autotune Choices Stats: 2025-12-04T10:01:25.8217259Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.8217512Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8217732Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8218047Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8219188Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8220305Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8221452Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8222602Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8223717Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8224873Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8225115Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:25.8225185Z Autotune Choices Stats: 2025-12-04T10:01:25.8226661Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.8227108Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8227487Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8228056Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8229231Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8230460Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8231622Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8232786Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8234004Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8235158Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8236318Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8237465Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8238661Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8239852Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8240098Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:25.8240226Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8240303Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8240365Z unimplemented [] 2025-12-04T10:01:25.8240471Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8240654Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8241887Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8241959Z graph_break [] 2025-12-04T10:01:25.8242085Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8242157Z Autotune Choices Stats: 2025-12-04T10:01:25.8243599Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.8243856Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8244072Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8244383Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8245517Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8246670Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8247823Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8248948Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8250065Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8251248Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8251491Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:25.8251567Z Autotune Choices Stats: 2025-12-04T10:01:25.8253001Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.8253448Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8253779Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8254354Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8255838Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8257078Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8258245Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8259448Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8260667Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8261823Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8262988Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8264145Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8265378Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8266550Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8266796Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:25.8266969Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8267050Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8267112Z unimplemented [] 2025-12-04T10:01:25.8267267Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8267484Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8268705Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8268775Z graph_break [] 2025-12-04T10:01:25.8268903Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8268977Z Autotune Choices Stats: 2025-12-04T10:01:25.8270368Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.8270621Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8270842Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8271158Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8272288Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8273485Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8274611Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8275738Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8276928Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8278049Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8278303Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:25.8278378Z Autotune Choices Stats: 2025-12-04T10:01:25.8279815Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.8280254Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8280620Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8281184Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8282402Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8283578Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8284746Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8285974Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8287141Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8288297Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8289466Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8290681Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8291848Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8293009Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8293292Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:25.8293424Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8293498Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8293562Z unimplemented [] 2025-12-04T10:01:25.8293670Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8293897Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8295080Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8295148Z graph_break [] 2025-12-04T10:01:25.8295275Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8295345Z Autotune Choices Stats: 2025-12-04T10:01:25.8296742Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.8296988Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8297208Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8297561Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8298722Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8299836Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8300960Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8302123Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8303278Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8304402Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8304651Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:25.8304726Z Autotune Choices Stats: 2025-12-04T10:01:25.8306163Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.8306656Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8306985Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8307671Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8308851Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8310023Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8311250Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8312413Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8313587Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8314744Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8315941Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8317135Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8318294Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8319486Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8319743Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:25.8319905Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8319984Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8320046Z unimplemented [] 2025-12-04T10:01:25.8320148Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8320339Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8321525Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8321593Z graph_break [] 2025-12-04T10:01:25.8321734Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8321805Z Autotune Choices Stats: 2025-12-04T10:01:25.8323212Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.8323502Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8323721Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8324034Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8325207Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8326325Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8327482Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8328628Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8329745Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8330872Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8331115Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:25.8331197Z Autotune Choices Stats: 2025-12-04T10:01:25.8332633Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.8333177Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8333511Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8334079Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8335248Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8336446Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8337652Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8338821Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8339983Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8341174Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8342363Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8343515Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8344676Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8345891Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8346142Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:25.8346269Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8346341Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8346402Z unimplemented [] 2025-12-04T10:01:25.8346504Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8346692Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8347937Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8348004Z graph_break [] 2025-12-04T10:01:25.8348131Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8348197Z Autotune Choices Stats: 2025-12-04T10:01:25.8349604Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.8349886Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8350144Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8350458Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8351600Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8352725Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8353911Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8355036Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8356439Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8357574Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8357894Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:25.8357967Z Autotune Choices Stats: 2025-12-04T10:01:25.8359467Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.8359918Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8360252Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8360819Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8362036Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8363245Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8364401Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8365565Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8366721Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8367960Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8369121Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8370274Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8371514Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8372670Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8372920Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:25.8373049Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8373123Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8373184Z unimplemented [] 2025-12-04T10:01:25.8373285Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8373477Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8374666Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8374772Z graph_break [] 2025-12-04T10:01:25.8374899Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8374966Z Autotune Choices Stats: 2025-12-04T10:01:25.8376420Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.8376670Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8376896Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8377211Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8378353Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8379537Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8380658Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8381782Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8382902Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8384026Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8384303Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:25.8384380Z Autotune Choices Stats: 2025-12-04T10:01:25.8385854Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.8386294Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8386624Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8387273Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8388492Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8389658Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8390821Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8391981Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8393167Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8394358Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8395516Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8396696Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8397888Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8399046Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8399294Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:25.8399422Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8399496Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8399557Z unimplemented [] 2025-12-04T10:01:25.8399659Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8399849Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8401032Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8401132Z graph_break [] 2025-12-04T10:01:25.8401257Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8401321Z Autotune Choices Stats: 2025-12-04T10:01:25.8402773Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.8403018Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8403237Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8403549Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8404723Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8405879Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8407001Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8408123Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8409247Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8410442Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8410687Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:25.8410760Z Autotune Choices Stats: 2025-12-04T10:01:25.8412198Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:25.8412667Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8412993Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8413581Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8414757Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8415937Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8417092Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8418285Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8419474Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8420630Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8421819Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8423003Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8424155Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8425311Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8425565Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:25.8425696Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8425809Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8425873Z unimplemented [] 2025-12-04T10:01:25.8425978Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8426170Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8427473Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8427545Z graph_break [] 2025-12-04T10:01:25.8427675Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8427744Z Autotune Choices Stats: 2025-12-04T10:01:25.8429156Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.8429436Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8429664Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8429981Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8431158Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8432279Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8433405Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8434528Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8435707Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8436866Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8437121Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:25.8437191Z Autotune Choices Stats: 2025-12-04T10:01:25.8438640Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.8439114Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8439471Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8440043Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8441221Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8442389Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8443550Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8444778Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8445938Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8447102Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8448323Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8449484Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8450645Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8451799Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8452084Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:25.8452215Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8452293Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8452356Z unimplemented [] 2025-12-04T10:01:25.8452456Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8452643Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8453853Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8453926Z graph_break [] 2025-12-04T10:01:25.8454052Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8454121Z Autotune Choices Stats: 2025-12-04T10:01:25.8455784Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.8456109Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8456333Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8456703Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8457839Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8458967Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8460090Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8461257Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8462435Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8463562Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8463805Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:25.8463909Z Autotune Choices Stats: 2025-12-04T10:01:25.8465398Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.8465844Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8466174Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8466730Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8467968Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8469154Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8470419Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8471579Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8472739Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8473966Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8475125Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8476271Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8477431Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8478627Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8478875Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:25.8479037Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8479111Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8479173Z unimplemented [] 2025-12-04T10:01:25.8479278Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8479468Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8480657Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8480723Z graph_break [] 2025-12-04T10:01:25.8480887Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8480955Z Autotune Choices Stats: 2025-12-04T10:01:25.8482398Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.8482647Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8482869Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8483186Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8484324Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8485437Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8486569Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8487752Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8488870Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8489999Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8490280Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:25.8490347Z Autotune Choices Stats: 2025-12-04T10:01:25.8491811Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.8492246Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8492582Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8493143Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8494322Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8495527Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8496725Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8497885Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8499091Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8500292Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8501448Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8502620Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8503778Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8504994Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8505246Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:25.8505376Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8505452Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8505516Z unimplemented [] 2025-12-04T10:01:25.8505617Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8505811Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8506991Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8507093Z graph_break [] 2025-12-04T10:01:25.8507277Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8507344Z Autotune Choices Stats: 2025-12-04T10:01:25.8508788Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.8509034Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8509254Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8509570Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8510704Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8511823Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8513023Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8514139Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8515261Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8516412Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8516685Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:25.8516754Z Autotune Choices Stats: 2025-12-04T10:01:25.8518198Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:25.8518632Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8518961Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8519521Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8520694Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8521932Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8523092Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8524254Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8525475Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8526637Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8527795Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8528954Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8530193Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8531341Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8531590Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:25.8531715Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8531793Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8531861Z unimplemented [] 2025-12-04T10:01:25.8531996Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8532184Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8533357Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8533423Z graph_break [] 2025-12-04T10:01:25.8533582Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8533648Z Autotune Choices Stats: 2025-12-04T10:01:25.8535058Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.8535301Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8535523Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8535836Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8536970Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8538119Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8539279Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8540398Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8541553Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8542705Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8542963Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:25.8543032Z Autotune Choices Stats: 2025-12-04T10:01:25.8544471Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.8544905Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8545240Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8545833Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8547045Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8548250Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8549416Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8550653Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8551812Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8552975Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8554128Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8555595Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8556829Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8557986Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8558282Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:25.8558413Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8558481Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8558549Z unimplemented [] 2025-12-04T10:01:25.8558649Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8558849Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8560084Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8560149Z graph_break [] 2025-12-04T10:01:25.8560280Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8560347Z Autotune Choices Stats: 2025-12-04T10:01:25.8561748Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.8562002Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8562229Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8562541Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8563756Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8564902Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8566036Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8567162Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8568348Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8569473Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8569723Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:25.8569790Z Autotune Choices Stats: 2025-12-04T10:01:25.8571241Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.8571676Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8572047Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8572609Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8573820Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8574984Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8576177Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8577380Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8578531Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8579697Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8580850Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8582084Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8583253Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8584399Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8584679Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:25.8584809Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8584877Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8584976Z unimplemented [] 2025-12-04T10:01:25.8585077Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8585262Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8586443Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8586504Z graph_break [] 2025-12-04T10:01:25.8586637Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8586704Z Autotune Choices Stats: 2025-12-04T10:01:25.8588220Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.8588466Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8588730Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8589044Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8590208Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8591323Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8592444Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8593637Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8594760Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8595879Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8596125Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:25.8596191Z Autotune Choices Stats: 2025-12-04T10:01:25.8597635Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:25.8598110Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8598474Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8599032Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8600212Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8601660Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8602894Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8604058Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8605211Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8606368Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8607586Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8608747Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8609900Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8611113Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8611363Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:25.8611496Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8611568Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8611640Z unimplemented [] 2025-12-04T10:01:25.8611743Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8611931Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8613119Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8613182Z graph_break [] 2025-12-04T10:01:25.8613315Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8613385Z Autotune Choices Stats: 2025-12-04T10:01:25.8614798Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.8615086Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8615312Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8615664Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8616802Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8617929Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8619089Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8620236Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8621359Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8622478Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8622741Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:25.8622846Z Autotune Choices Stats: 2025-12-04T10:01:25.8624289Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.8624757Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8625092Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8625654Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8626833Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8628152Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8629328Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8630501Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8631667Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8632863Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8634048Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8635213Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8636396Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8637581Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8637836Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:25.8637966Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8638038Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8638109Z unimplemented [] 2025-12-04T10:01:25.8638210Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8638394Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8639587Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8639651Z graph_break [] 2025-12-04T10:01:25.8639785Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8639890Z Autotune Choices Stats: 2025-12-04T10:01:25.8641294Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_574", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.8641569Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8641793Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8642106Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8643233Z triton_flex_attention_574 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8644392Z triton_flex_attention_575 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8645546Z triton_flex_attention_572 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8646672Z triton_flex_attention_570 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8647793Z triton_flex_attention_573 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8648909Z triton_flex_attention_571 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8649189Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3286 seconds precompiling for 6 choices 2025-12-04T10:01:25.8649256Z Autotune Choices Stats: 2025-12-04T10:01:25.8650728Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014303999952971935, "best_triton_pos": 0} 2025-12-04T10:01:25.8651168Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8651507Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8652066Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8653273Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8654464Z triton_flex_attention_backward_576 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8655890Z triton_flex_attention_backward_577 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8657075Z triton_flex_attention_backward_578 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8658304Z triton_flex_attention_backward_581 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8659533Z triton_flex_attention_backward_583 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8660695Z triton_flex_attention_backward_580 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8661856Z triton_flex_attention_backward_582 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8663100Z triton_flex_attention_backward_585 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8664267Z triton_flex_attention_backward_588 0.0174 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8664519Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2260 seconds precompiling for 13 choices 2025-12-04T10:01:25.8664652Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8664722Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8664791Z unimplemented [] 2025-12-04T10:01:25.8664892Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8665076Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8666266Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8666362Z graph_break [] 2025-12-04T10:01:25.8666502Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8666568Z Autotune Choices Stats: 2025-12-04T10:01:25.8668071Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.8668322Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8668547Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8668860Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8670007Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8671196Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8672325Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8673438Z triton_flex_attention_589 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8674573Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8675731Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8676013Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.2961 seconds precompiling for 6 choices 2025-12-04T10:01:25.8676082Z Autotune Choices Stats: 2025-12-04T10:01:25.8677531Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.8677969Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8678338Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8678899Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8680105Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8681266Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8682431Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8683592Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8684826Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8685988Z triton_flex_attention_backward_601 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8687136Z triton_flex_attention_backward_602 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8688359Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8689513Z triton_flex_attention_backward_604 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8690677Z triton_flex_attention_backward_603 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8690928Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.3242 seconds precompiling for 13 choices 2025-12-04T10:01:25.8691061Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8691132Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8691202Z unimplemented [] 2025-12-04T10:01:25.8691303Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8691540Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8692735Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8692798Z graph_break [] 2025-12-04T10:01:25.8692967Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8693035Z Autotune Choices Stats: 2025-12-04T10:01:25.8694439Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.8694682Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8694908Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8695267Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8696431Z triton_flex_attention_612 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8697558Z triton_flex_attention_613 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8698682Z triton_flex_attention_608 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8699806Z triton_flex_attention_610 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8700973Z triton_flex_attention_611 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8702121Z triton_flex_attention_609 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8702372Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3064 seconds precompiling for 6 choices 2025-12-04T10:01:25.8702439Z Autotune Choices Stats: 2025-12-04T10:01:25.8703879Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_616", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.8704346Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8704679Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8705269Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8706449Z triton_flex_attention_backward_616 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8707672Z triton_flex_attention_backward_615 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8708838Z triton_flex_attention_backward_617 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8710027Z triton_flex_attention_backward_614 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8711224Z triton_flex_attention_backward_619 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8712390Z triton_flex_attention_backward_620 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8713574Z triton_flex_attention_backward_618 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8714768Z triton_flex_attention_backward_621 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8715920Z triton_flex_attention_backward_623 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8717091Z triton_flex_attention_backward_622 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8717339Z SingleProcess AUTOTUNE benchmarking takes 0.6701 seconds and 2.3164 seconds precompiling for 13 choices 2025-12-04T10:01:25.8717512Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8717581Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8717647Z unimplemented [] 2025-12-04T10:01:25.8717747Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8717933Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8719156Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8719220Z graph_break [] 2025-12-04T10:01:25.8719352Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8719416Z Autotune Choices Stats: 2025-12-04T10:01:25.8720817Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.8721091Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8721313Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8721630Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8722814Z triton_flex_attention_631 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8723944Z triton_flex_attention_632 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8725072Z triton_flex_attention_629 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8726192Z triton_flex_attention_627 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8727387Z triton_flex_attention_630 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8728504Z triton_flex_attention_628 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8728753Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.2948 seconds precompiling for 6 choices 2025-12-04T10:01:25.8728818Z Autotune Choices Stats: 2025-12-04T10:01:25.8730262Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.8730764Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8731108Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8731672Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8732854Z triton_flex_attention_backward_634 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8734020Z triton_flex_attention_backward_635 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8735221Z triton_flex_attention_backward_636 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8736409Z triton_flex_attention_backward_633 0.0144 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8737575Z triton_flex_attention_backward_638 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8738814Z triton_flex_attention_backward_640 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8740006Z triton_flex_attention_backward_637 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8741166Z triton_flex_attention_backward_639 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8742322Z triton_flex_attention_backward_642 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8743482Z triton_flex_attention_backward_641 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8743761Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.2589 seconds precompiling for 13 choices 2025-12-04T10:01:25.8743894Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8743962Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8744030Z unimplemented [] 2025-12-04T10:01:25.8744167Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8744353Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8745532Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8745597Z graph_break [] 2025-12-04T10:01:25.8745729Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8745805Z Autotune Choices Stats: 2025-12-04T10:01:25.8747597Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.8747931Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8748236Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8748610Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8749758Z triton_flex_attention_650 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8750906Z triton_flex_attention_651 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8752044Z triton_flex_attention_646 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8753207Z triton_flex_attention_648 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8754371Z triton_flex_attention_649 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8755765Z triton_flex_attention_647 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8756133Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3235 seconds precompiling for 6 choices 2025-12-04T10:01:25.8756213Z Autotune Choices Stats: 2025-12-04T10:01:25.8757989Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_653", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.8758431Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8758769Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8759330Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8760522Z triton_flex_attention_backward_653 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8761690Z triton_flex_attention_backward_654 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8762942Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8764103Z triton_flex_attention_backward_652 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8765267Z triton_flex_attention_backward_656 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8766510Z triton_flex_attention_backward_657 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8767677Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8768837Z triton_flex_attention_backward_658 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8769997Z triton_flex_attention_backward_661 0.0164 ms 80.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8771194Z triton_flex_attention_backward_660 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8771471Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.4000 seconds precompiling for 13 choices 2025-12-04T10:01:25.8771607Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8771675Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8771745Z unimplemented [] 2025-12-04T10:01:25.8771847Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8772033Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8773223Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8773319Z graph_break [] 2025-12-04T10:01:25.8773450Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8773515Z Autotune Choices Stats: 2025-12-04T10:01:25.8774955Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.8775208Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8775426Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8775746Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8776882Z triton_flex_attention_669 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8778022Z triton_flex_attention_670 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8779177Z triton_flex_attention_665 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8780337Z triton_flex_attention_667 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8781462Z triton_flex_attention_668 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8782610Z triton_flex_attention_666 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8782861Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.2940 seconds precompiling for 6 choices 2025-12-04T10:01:25.8782926Z Autotune Choices Stats: 2025-12-04T10:01:25.8784403Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.8784841Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8785174Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8785733Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8786911Z triton_flex_attention_backward_673 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8788242Z triton_flex_attention_backward_674 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8789424Z triton_flex_attention_backward_671 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8790584Z triton_flex_attention_backward_672 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8791822Z triton_flex_attention_backward_676 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8792996Z triton_flex_attention_backward_675 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8794161Z triton_flex_attention_backward_677 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8795326Z triton_flex_attention_backward_678 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8796527Z triton_flex_attention_backward_680 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8797725Z triton_flex_attention_backward_683 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8797974Z SingleProcess AUTOTUNE benchmarking takes 0.6679 seconds and 2.3879 seconds precompiling for 13 choices 2025-12-04T10:01:25.8798108Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8798176Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8798249Z unimplemented [] 2025-12-04T10:01:25.8798357Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8798541Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8799765Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8799828Z graph_break [] 2025-12-04T10:01:25.8799958Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8800025Z Autotune Choices Stats: 2025-12-04T10:01:25.8801453Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_689", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010367999784648418, "best_triton_pos": 0} 2025-12-04T10:01:25.8801703Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8801920Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8802241Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8803384Z triton_flex_attention_689 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8804516Z triton_flex_attention_688 0.0113 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8805708Z triton_flex_attention_684 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8806845Z triton_flex_attention_686 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8807978Z triton_flex_attention_687 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8809160Z triton_flex_attention_685 0.0164 ms 63.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8809412Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3120 seconds precompiling for 6 choices 2025-12-04T10:01:25.8809481Z Autotune Choices Stats: 2025-12-04T10:01:25.8810939Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_690", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.8811387Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8811722Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8812281Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8813494Z triton_flex_attention_backward_690 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8814694Z triton_flex_attention_backward_691 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8815871Z triton_flex_attention_backward_692 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8817073Z triton_flex_attention_backward_693 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8818277Z triton_flex_attention_backward_695 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8819582Z triton_flex_attention_backward_696 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8820750Z triton_flex_attention_backward_697 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8821918Z triton_flex_attention_backward_694 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8823154Z triton_flex_attention_backward_699 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8824313Z triton_flex_attention_backward_702 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8824561Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.3417 seconds precompiling for 13 choices 2025-12-04T10:01:25.8824697Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8824803Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8824866Z unimplemented [] 2025-12-04T10:01:25.8824978Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8825163Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8826387Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8826454Z graph_break [] 2025-12-04T10:01:25.8826587Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8826654Z Autotune Choices Stats: 2025-12-04T10:01:25.8828130Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.8828380Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8828595Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8828917Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8830054Z triton_flex_attention_707 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8831259Z triton_flex_attention_708 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8832386Z triton_flex_attention_705 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8833522Z triton_flex_attention_706 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8834710Z triton_flex_attention_703 0.0144 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8835837Z triton_flex_attention_704 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8836084Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.2972 seconds precompiling for 6 choices 2025-12-04T10:01:25.8836151Z Autotune Choices Stats: 2025-12-04T10:01:25.8837596Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_709", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.8838031Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8838361Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8838957Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8840162Z triton_flex_attention_backward_709 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8841338Z triton_flex_attention_backward_710 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8842503Z triton_flex_attention_backward_711 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8843725Z triton_flex_attention_backward_712 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8844890Z triton_flex_attention_backward_714 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8846052Z triton_flex_attention_backward_715 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8847219Z triton_flex_attention_backward_713 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8848449Z triton_flex_attention_backward_716 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8849604Z triton_flex_attention_backward_721 0.0174 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8850768Z triton_flex_attention_backward_717 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8851044Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.1509 seconds precompiling for 13 choices 2025-12-04T10:01:25.8851179Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8851246Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8851307Z unimplemented [] 2025-12-04T10:01:25.8851414Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8851647Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8852836Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8852899Z graph_break [] 2025-12-04T10:01:25.8853030Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8853099Z Autotune Choices Stats: 2025-12-04T10:01:25.8854502Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.8854752Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8854968Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8855572Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8856717Z triton_flex_attention_726 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8857916Z triton_flex_attention_727 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8859045Z triton_flex_attention_722 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8860229Z triton_flex_attention_724 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8861405Z triton_flex_attention_725 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8862527Z triton_flex_attention_723 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8862779Z SingleProcess AUTOTUNE benchmarking takes 0.2941 seconds and 1.2814 seconds precompiling for 6 choices 2025-12-04T10:01:25.8862845Z Autotune Choices Stats: 2025-12-04T10:01:25.8864291Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_729", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.8864776Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8865110Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8865704Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8866886Z triton_flex_attention_backward_729 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8868124Z triton_flex_attention_backward_730 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8869370Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8870538Z triton_flex_attention_backward_733 0.0144 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8871703Z triton_flex_attention_backward_728 0.0153 ms 87.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8872862Z triton_flex_attention_backward_735 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8874064Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8875290Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8876449Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8877642Z triton_flex_attention_backward_740 0.0173 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8877884Z SingleProcess AUTOTUNE benchmarking takes 0.6682 seconds and 2.2393 seconds precompiling for 13 choices 2025-12-04T10:01:25.8878053Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8878124Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8878188Z unimplemented [] 2025-12-04T10:01:25.8878295Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8878485Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8879672Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8879737Z graph_break [] 2025-12-04T10:01:25.8879866Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8879939Z Autotune Choices Stats: 2025-12-04T10:01:25.8881344Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.8881630Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8881849Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8882168Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8883338Z triton_flex_attention_745 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8884473Z triton_flex_attention_746 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8885653Z triton_flex_attention_743 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8886816Z triton_flex_attention_741 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8887958Z triton_flex_attention_744 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8889089Z triton_flex_attention_742 0.0164 ms 62.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8889340Z SingleProcess AUTOTUNE benchmarking takes 0.2954 seconds and 1.3187 seconds precompiling for 6 choices 2025-12-04T10:01:25.8889407Z Autotune Choices Stats: 2025-12-04T10:01:25.8890852Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_750", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.8891329Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8891689Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8892256Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8893426Z triton_flex_attention_backward_750 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8894624Z triton_flex_attention_backward_748 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8895824Z triton_flex_attention_backward_749 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8896990Z triton_flex_attention_backward_753 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8898152Z triton_flex_attention_backward_747 0.0144 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8899314Z triton_flex_attention_backward_752 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8900542Z triton_flex_attention_backward_754 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8901714Z triton_flex_attention_backward_751 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8902869Z triton_flex_attention_backward_756 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8904094Z triton_flex_attention_backward_759 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8904335Z SingleProcess AUTOTUNE benchmarking takes 0.6710 seconds and 2.3823 seconds precompiling for 13 choices 2025-12-04T10:01:25.8904473Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8904542Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8904604Z unimplemented [] 2025-12-04T10:01:25.8904708Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8904893Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8906085Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8906147Z graph_break [] 2025-12-04T10:01:25.8906284Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8906357Z Autotune Choices Stats: 2025-12-04T10:01:25.8907806Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_765", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010304000228643417, "best_triton_pos": 0} 2025-12-04T10:01:25.8908101Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8908354Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8908674Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8909813Z triton_flex_attention_765 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8910941Z triton_flex_attention_764 0.0113 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8912133Z triton_flex_attention_762 0.0133 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8913258Z triton_flex_attention_760 0.0143 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8914382Z triton_flex_attention_763 0.0143 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8915515Z triton_flex_attention_761 0.0154 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8915795Z SingleProcess AUTOTUNE benchmarking takes 0.2951 seconds and 1.3301 seconds precompiling for 6 choices 2025-12-04T10:01:25.8915862Z Autotune Choices Stats: 2025-12-04T10:01:25.8917351Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_767", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.8917790Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8918123Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8918692Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8919901Z triton_flex_attention_backward_767 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8921103Z triton_flex_attention_backward_769 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8922272Z triton_flex_attention_backward_766 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8923446Z triton_flex_attention_backward_768 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8924610Z triton_flex_attention_backward_771 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8925831Z triton_flex_attention_backward_772 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8926995Z triton_flex_attention_backward_770 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8928161Z triton_flex_attention_backward_773 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8929376Z triton_flex_attention_backward_775 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8930536Z triton_flex_attention_backward_778 0.0174 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8930782Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2444 seconds precompiling for 13 choices 2025-12-04T10:01:25.8930928Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8930997Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8931059Z unimplemented [] 2025-12-04T10:01:25.8931167Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8931352Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8932544Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8932642Z graph_break [] 2025-12-04T10:01:25.8932770Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8932842Z Autotune Choices Stats: 2025-12-04T10:01:25.8934285Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_783", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.8934536Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8934755Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8935073Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8936204Z triton_flex_attention_783 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8937364Z triton_flex_attention_784 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8938520Z triton_flex_attention_779 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8939652Z triton_flex_attention_781 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8940772Z triton_flex_attention_782 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8941908Z triton_flex_attention_780 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8942189Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3189 seconds precompiling for 6 choices 2025-12-04T10:01:25.8942258Z Autotune Choices Stats: 2025-12-04T10:01:25.8943725Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_786", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.8944166Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8944495Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8945097Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8946302Z triton_flex_attention_backward_786 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8947546Z triton_flex_attention_backward_787 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8948714Z triton_flex_attention_backward_788 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8949876Z triton_flex_attention_backward_785 0.0145 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8951095Z triton_flex_attention_backward_790 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8952289Z triton_flex_attention_backward_791 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8953458Z triton_flex_attention_backward_792 0.0155 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8954652Z triton_flex_attention_backward_789 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8956186Z triton_flex_attention_backward_794 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8957376Z triton_flex_attention_backward_797 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.8957627Z SingleProcess AUTOTUNE benchmarking takes 0.6703 seconds and 2.2711 seconds precompiling for 13 choices 2025-12-04T10:01:25.8957761Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8957829Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8957892Z unimplemented [] 2025-12-04T10:01:25.8957999Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8958183Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8959380Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8959495Z graph_break [] 2025-12-04T10:01:25.8959621Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8959707Z Autotune Choices Stats: 2025-12-04T10:01:25.8961160Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_803", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.8961413Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8961630Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8961950Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8963131Z triton_flex_attention_803 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8964292Z triton_flex_attention_802 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8965417Z triton_flex_attention_800 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8966550Z triton_flex_attention_798 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8967676Z triton_flex_attention_801 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8968866Z triton_flex_attention_799 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8969110Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T10:01:25.8969179Z Autotune Choices Stats: 2025-12-04T10:01:25.8970622Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_806", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.8971097Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8971422Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8972022Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8973195Z triton_flex_attention_backward_806 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8974361Z triton_flex_attention_backward_805 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8975527Z triton_flex_attention_backward_807 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8976720Z triton_flex_attention_backward_804 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8977922Z triton_flex_attention_backward_809 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8979095Z triton_flex_attention_backward_810 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.8980259Z triton_flex_attention_backward_811 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.8981499Z triton_flex_attention_backward_808 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8982668Z triton_flex_attention_backward_812 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.8983834Z triton_flex_attention_backward_813 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8984081Z SingleProcess AUTOTUNE benchmarking takes 0.6698 seconds and 2.2839 seconds precompiling for 13 choices 2025-12-04T10:01:25.8984216Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.8984318Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.8984384Z unimplemented [] 2025-12-04T10:01:25.8984489Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.8984674Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.8985895Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.8985971Z graph_break [] 2025-12-04T10:01:25.8986107Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.8986183Z Autotune Choices Stats: 2025-12-04T10:01:25.8987663Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_821", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.8987952Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8988170Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8988488Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8989668Z triton_flex_attention_821 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8990797Z triton_flex_attention_822 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8991924Z triton_flex_attention_817 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8993050Z triton_flex_attention_819 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.8994206Z triton_flex_attention_820 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.8995381Z triton_flex_attention_818 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.8995628Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3176 seconds precompiling for 6 choices 2025-12-04T10:01:25.8995700Z Autotune Choices Stats: 2025-12-04T10:01:25.8997142Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_825", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.8997614Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.8997974Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.8998544Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.8999730Z triton_flex_attention_backward_825 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9000904Z triton_flex_attention_backward_824 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9006069Z triton_flex_attention_backward_826 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9007423Z triton_flex_attention_backward_823 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9008606Z triton_flex_attention_backward_828 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9009782Z triton_flex_attention_backward_827 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9011028Z triton_flex_attention_backward_829 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9012194Z triton_flex_attention_backward_830 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9013361Z triton_flex_attention_backward_832 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9014519Z triton_flex_attention_backward_835 0.0164 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9014855Z SingleProcess AUTOTUNE benchmarking takes 0.6673 seconds and 2.2875 seconds precompiling for 13 choices 2025-12-04T10:01:25.9015005Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9015081Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9015147Z unimplemented [] 2025-12-04T10:01:25.9015264Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9015456Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9016680Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9016753Z graph_break [] 2025-12-04T10:01:25.9016886Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9016965Z Autotune Choices Stats: 2025-12-04T10:01:25.9018374Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_840", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.9018685Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9018909Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9019266Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9020405Z triton_flex_attention_840 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9021528Z triton_flex_attention_841 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9022646Z triton_flex_attention_836 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9023805Z triton_flex_attention_838 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9024951Z triton_flex_attention_839 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9026073Z triton_flex_attention_837 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9026319Z SingleProcess AUTOTUNE benchmarking takes 0.2950 seconds and 1.3350 seconds precompiling for 6 choices 2025-12-04T10:01:25.9026427Z Autotune Choices Stats: 2025-12-04T10:01:25.9028000Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_843", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.9028451Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9028783Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9029340Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9030515Z triton_flex_attention_backward_843 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9031683Z triton_flex_attention_backward_844 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9032909Z triton_flex_attention_backward_845 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9034073Z triton_flex_attention_backward_842 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9035237Z triton_flex_attention_backward_847 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9036434Z triton_flex_attention_backward_846 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9037627Z triton_flex_attention_backward_848 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9038781Z triton_flex_attention_backward_849 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9039947Z triton_flex_attention_backward_851 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9041111Z triton_flex_attention_backward_850 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9041392Z SingleProcess AUTOTUNE benchmarking takes 0.6676 seconds and 2.3506 seconds precompiling for 13 choices 2025-12-04T10:01:25.9041525Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9041633Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9041699Z unimplemented [] 2025-12-04T10:01:25.9041813Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9042002Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9043190Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9043258Z graph_break [] 2025-12-04T10:01:25.9043399Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9043512Z Autotune Choices Stats: 2025-12-04T10:01:25.9044948Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_859", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9045201Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9045425Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9045739Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9046887Z triton_flex_attention_859 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9048008Z triton_flex_attention_860 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9049129Z triton_flex_attention_857 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9050338Z triton_flex_attention_858 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9051465Z triton_flex_attention_855 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9052587Z triton_flex_attention_856 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9052963Z SingleProcess AUTOTUNE benchmarking takes 0.2946 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:25.9053039Z Autotune Choices Stats: 2025-12-04T10:01:25.9054500Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_862", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.9054948Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9055569Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9056172Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9057358Z triton_flex_attention_backward_862 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9058616Z triton_flex_attention_backward_863 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9059830Z triton_flex_attention_backward_864 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9061005Z triton_flex_attention_backward_861 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9062218Z triton_flex_attention_backward_865 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9063425Z triton_flex_attention_backward_866 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9064595Z triton_flex_attention_backward_868 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9065756Z triton_flex_attention_backward_867 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9066927Z triton_flex_attention_backward_870 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9068245Z triton_flex_attention_backward_869 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9068500Z SingleProcess AUTOTUNE benchmarking takes 0.6670 seconds and 2.3594 seconds precompiling for 13 choices 2025-12-04T10:01:25.9068636Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9068716Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9068781Z unimplemented [] 2025-12-04T10:01:25.9068891Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9069080Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9070275Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9070385Z graph_break [] 2025-12-04T10:01:25.9070519Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9070594Z Autotune Choices Stats: 2025-12-04T10:01:25.9072031Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_878", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9072290Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9072517Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9072834Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9073979Z triton_flex_attention_878 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9075099Z triton_flex_attention_879 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9076282Z triton_flex_attention_874 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9077412Z triton_flex_attention_876 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9078541Z triton_flex_attention_877 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9079699Z triton_flex_attention_875 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9079976Z SingleProcess AUTOTUNE benchmarking takes 0.2950 seconds and 1.3095 seconds precompiling for 6 choices 2025-12-04T10:01:25.9080054Z Autotune Choices Stats: 2025-12-04T10:01:25.9081502Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_880", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.9081954Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9082285Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9082847Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9084018Z triton_flex_attention_backward_880 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9085272Z triton_flex_attention_backward_881 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9086437Z triton_flex_attention_backward_882 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9087598Z triton_flex_attention_backward_883 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9088823Z triton_flex_attention_backward_885 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9089978Z triton_flex_attention_backward_886 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9091151Z triton_flex_attention_backward_884 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9092308Z triton_flex_attention_backward_887 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9093504Z triton_flex_attention_backward_889 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9094696Z triton_flex_attention_backward_888 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9094945Z SingleProcess AUTOTUNE benchmarking takes 0.6696 seconds and 2.3839 seconds precompiling for 13 choices 2025-12-04T10:01:25.9095074Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9095149Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9095210Z unimplemented [] 2025-12-04T10:01:25.9095314Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9095533Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9096865Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9096948Z graph_break [] 2025-12-04T10:01:25.9097145Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9097230Z Autotune Choices Stats: 2025-12-04T10:01:25.9098792Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_897", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9099052Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9099274Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9099583Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9100723Z triton_flex_attention_897 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9101870Z triton_flex_attention_898 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9103022Z triton_flex_attention_893 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9104145Z triton_flex_attention_895 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9105300Z triton_flex_attention_896 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9106461Z triton_flex_attention_894 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9106707Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3269 seconds precompiling for 6 choices 2025-12-04T10:01:25.9106778Z Autotune Choices Stats: 2025-12-04T10:01:25.9108275Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_902", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.9108717Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9109048Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9109647Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9110850Z triton_flex_attention_backward_902 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9112017Z triton_flex_attention_backward_900 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9113183Z triton_flex_attention_backward_901 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9114429Z triton_flex_attention_backward_904 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9115590Z triton_flex_attention_backward_899 0.0145 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9116916Z triton_flex_attention_backward_905 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9118189Z triton_flex_attention_backward_906 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9119381Z triton_flex_attention_backward_903 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9120578Z triton_flex_attention_backward_908 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9121743Z triton_flex_attention_backward_907 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9122021Z SingleProcess AUTOTUNE benchmarking takes 0.6679 seconds and 2.3006 seconds precompiling for 13 choices 2025-12-04T10:01:25.9122152Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9122228Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9122292Z unimplemented [] 2025-12-04T10:01:25.9122395Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9122587Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9123801Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9123875Z graph_break [] 2025-12-04T10:01:25.9124004Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9124078Z Autotune Choices Stats: 2025-12-04T10:01:25.9125477Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_916", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.9125728Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9125952Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9126262Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9127446Z triton_flex_attention_916 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9128589Z triton_flex_attention_917 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9129728Z triton_flex_attention_914 0.0133 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9130850Z triton_flex_attention_912 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9132036Z triton_flex_attention_915 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9133161Z triton_flex_attention_913 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9133407Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.3079 seconds precompiling for 6 choices 2025-12-04T10:01:25.9133482Z Autotune Choices Stats: 2025-12-04T10:01:25.9134920Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_919", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013407999649643898, "best_triton_pos": 0} 2025-12-04T10:01:25.9135363Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9135733Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9136295Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9137497Z triton_flex_attention_backward_919 0.0134 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9138665Z triton_flex_attention_backward_918 0.0143 ms 93.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9139853Z triton_flex_attention_backward_920 0.0143 ms 93.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9141049Z triton_flex_attention_backward_921 0.0143 ms 93.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9142206Z triton_flex_attention_backward_923 0.0154 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9143361Z triton_flex_attention_backward_922 0.0164 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9144522Z triton_flex_attention_backward_924 0.0164 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9145743Z triton_flex_attention_backward_925 0.0164 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9147050Z triton_flex_attention_backward_927 0.0164 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9148492Z triton_flex_attention_backward_926 0.0174 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9148851Z SingleProcess AUTOTUNE benchmarking takes 0.6696 seconds and 2.2834 seconds precompiling for 13 choices 2025-12-04T10:01:25.9149028Z ___________ TestLearnableBiasesCUDA.test_flex_attention_logging_cuda ___________ 2025-12-04T10:01:25.9149119Z Traceback (most recent call last): 2025-12-04T10:01:25.9149498Z File "/var/lib/jenkins/workspace/test/inductor/test_flex_attention.py", line 7343, in test_flex_attention_logging 2025-12-04T10:01:25.9149566Z self.assertTrue( 2025-12-04T10:01:25.9149780Z File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue 2025-12-04T10:01:25.9149865Z raise self.failureException(msg) 2025-12-04T10:01:25.9150115Z AssertionError: False is not true : Log file /tmp/tmpf8ob5xno/flex_attention_configs.json was not created 2025-12-04T10:01:25.9150121Z 2025-12-04T10:01:25.9150257Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:25.9150514Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:25.9150519Z 2025-12-04T10:01:25.9150686Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:25.9150817Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9150893Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9150959Z unimplemented [] 2025-12-04T10:01:25.9151067Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9152262Z inductor [('triton_bundler_save_kernel', 232), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('async_compile_cache_miss', 24), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2), ('async_compile_cache_hit', 1)] 2025-12-04T10:01:25.9152488Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9152556Z graph_break [] 2025-12-04T10:01:25.9152685Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9153681Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py:3433: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:01:25.9153769Z current_size = base.storage().size() 2025-12-04T10:01:25.9153881Z Autotune Choices Stats: 2025-12-04T10:01:25.9155551Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_4", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009375999681651592, "best_triton_pos": 0} 2025-12-04T10:01:25.9155827Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9156056Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9156455Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9157652Z triton_flex_attention_4 0.0094 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9158763Z triton_flex_attention_5 0.0102 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9159876Z triton_flex_attention_2 0.0133 ms 70.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9160983Z triton_flex_attention_0 0.0143 ms 65.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9162104Z triton_flex_attention_3 0.0154 ms 61.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9163318Z triton_flex_attention_1 0.0164 ms 57.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9163571Z SingleProcess AUTOTUNE benchmarking takes 0.2592 seconds and 1.6406 seconds precompiling for 6 choices 2025-12-04T10:01:25.9163640Z Autotune Choices Stats: 2025-12-04T10:01:25.9165083Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_7", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.9165561Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9165905Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9166607Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9167936Z triton_flex_attention_backward_7 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9169097Z triton_flex_attention_backward_8 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9170254Z triton_flex_attention_backward_6 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9171445Z triton_flex_attention_backward_9 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9172628Z triton_flex_attention_backward_11 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9173790Z triton_flex_attention_backward_10 0.0155 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9174977Z triton_flex_attention_backward_12 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9176186Z triton_flex_attention_backward_13 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9177347Z triton_flex_attention_backward_15 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9178498Z triton_flex_attention_backward_14 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9178752Z SingleProcess AUTOTUNE benchmarking takes 0.8612 seconds and 2.3302 seconds precompiling for 13 choices 2025-12-04T10:01:25.9178928Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9179001Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9179072Z unimplemented [] 2025-12-04T10:01:25.9179177Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9179375Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9180607Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9180672Z graph_break [] 2025-12-04T10:01:25.9180808Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9180877Z Autotune Choices Stats: 2025-12-04T10:01:25.9182285Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_23", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9182565Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9182790Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9183107Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9184284Z triton_flex_attention_23 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9185404Z triton_flex_attention_24 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9186527Z triton_flex_attention_21 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9187712Z triton_flex_attention_22 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9188892Z triton_flex_attention_19 0.0122 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9190008Z triton_flex_attention_20 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9190263Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.2929 seconds precompiling for 6 choices 2025-12-04T10:01:25.9190335Z Autotune Choices Stats: 2025-12-04T10:01:25.9191780Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_26", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9192249Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9192621Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9193179Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9194351Z triton_flex_attention_backward_26 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9195523Z triton_flex_attention_backward_28 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9196711Z triton_flex_attention_backward_27 0.0121 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9197907Z triton_flex_attention_backward_25 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9199059Z triton_flex_attention_backward_30 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9200224Z triton_flex_attention_backward_29 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9201443Z triton_flex_attention_backward_31 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9202612Z triton_flex_attention_backward_32 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9203773Z triton_flex_attention_backward_33 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9204928Z triton_flex_attention_backward_34 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9205212Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2118 seconds precompiling for 13 choices 2025-12-04T10:01:25.9205343Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9205414Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9205489Z unimplemented [] 2025-12-04T10:01:25.9205591Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9205813Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9207010Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9207073Z graph_break [] 2025-12-04T10:01:25.9207206Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9207274Z Autotune Choices Stats: 2025-12-04T10:01:25.9208675Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_42", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9208960Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9209217Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9209536Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9210670Z triton_flex_attention_42 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9211785Z triton_flex_attention_43 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9212905Z triton_flex_attention_40 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9214064Z triton_flex_attention_41 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9215212Z triton_flex_attention_38 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9216334Z triton_flex_attention_39 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9216617Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2944 seconds precompiling for 6 choices 2025-12-04T10:01:25.9216684Z Autotune Choices Stats: 2025-12-04T10:01:25.9218160Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_45", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9218598Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9218934Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9219487Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9220663Z triton_flex_attention_backward_45 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9221823Z triton_flex_attention_backward_46 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9223047Z triton_flex_attention_backward_47 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9224203Z triton_flex_attention_backward_44 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9225356Z triton_flex_attention_backward_49 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9226577Z triton_flex_attention_backward_48 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9227784Z triton_flex_attention_backward_50 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9228949Z triton_flex_attention_backward_51 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9230106Z triton_flex_attention_backward_56 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9231289Z triton_flex_attention_backward_52 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9231538Z SingleProcess AUTOTUNE benchmarking takes 0.6621 seconds and 2.2749 seconds precompiling for 13 choices 2025-12-04T10:01:25.9231700Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9231771Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9231844Z unimplemented [] 2025-12-04T10:01:25.9231944Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9232131Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9233318Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9233413Z graph_break [] 2025-12-04T10:01:25.9233548Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9233616Z Autotune Choices Stats: 2025-12-04T10:01:25.9235051Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_61", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9235300Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9235527Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9235836Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9236965Z triton_flex_attention_61 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9238094Z triton_flex_attention_62 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9239250Z triton_flex_attention_59 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9240396Z triton_flex_attention_60 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9241514Z triton_flex_attention_57 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9242625Z triton_flex_attention_58 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9242909Z SingleProcess AUTOTUNE benchmarking takes 0.2908 seconds and 1.2678 seconds precompiling for 6 choices 2025-12-04T10:01:25.9242977Z Autotune Choices Stats: 2025-12-04T10:01:25.9244466Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_63", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9244902Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9245251Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9245808Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9246982Z triton_flex_attention_backward_63 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9248176Z triton_flex_attention_backward_64 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9249364Z triton_flex_attention_backward_65 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9250523Z triton_flex_attention_backward_66 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9251723Z triton_flex_attention_backward_67 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9252918Z triton_flex_attention_backward_68 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9254074Z triton_flex_attention_backward_69 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9255435Z triton_flex_attention_backward_70 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9256651Z triton_flex_attention_backward_72 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9257936Z triton_flex_attention_backward_75 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9258198Z SingleProcess AUTOTUNE benchmarking takes 0.6625 seconds and 2.2137 seconds precompiling for 13 choices 2025-12-04T10:01:25.9258331Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9258400Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9258470Z unimplemented [] 2025-12-04T10:01:25.9258576Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9258764Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9259960Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9260074Z graph_break [] 2025-12-04T10:01:25.9260209Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9260277Z Autotune Choices Stats: 2025-12-04T10:01:25.9261725Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_80", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9261971Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9262197Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9262506Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9263637Z triton_flex_attention_80 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9264757Z triton_flex_attention_81 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9265942Z triton_flex_attention_78 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9267064Z triton_flex_attention_79 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9268247Z triton_flex_attention_76 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9269428Z triton_flex_attention_77 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9269679Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3222 seconds precompiling for 6 choices 2025-12-04T10:01:25.9269746Z Autotune Choices Stats: 2025-12-04T10:01:25.9271192Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_82", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9271636Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9271973Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9272526Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9273743Z triton_flex_attention_backward_82 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9274928Z triton_flex_attention_backward_83 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9276094Z triton_flex_attention_backward_84 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9277246Z triton_flex_attention_backward_85 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9278483Z triton_flex_attention_backward_87 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9279639Z triton_flex_attention_backward_88 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9280791Z triton_flex_attention_backward_86 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9281949Z triton_flex_attention_backward_89 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9283167Z triton_flex_attention_backward_94 0.0142 ms 79.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9284324Z triton_flex_attention_backward_90 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9284576Z SingleProcess AUTOTUNE benchmarking takes 0.6623 seconds and 2.2708 seconds precompiling for 13 choices 2025-12-04T10:01:25.9284709Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9284778Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9284916Z unimplemented [] 2025-12-04T10:01:25.9285021Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9285206Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9286432Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9286507Z graph_break [] 2025-12-04T10:01:25.9286642Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9286711Z Autotune Choices Stats: 2025-12-04T10:01:25.9288117Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_99", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9288365Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9288585Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9288904Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9290031Z triton_flex_attention_99 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9291200Z triton_flex_attention_100 0.0092 ms 89.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9292367Z triton_flex_attention_97 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9295675Z triton_flex_attention_98 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9296805Z triton_flex_attention_95 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9299503Z triton_flex_attention_96 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9299763Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.2558 seconds precompiling for 6 choices 2025-12-04T10:01:25.9299840Z Autotune Choices Stats: 2025-12-04T10:01:25.9301276Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_101", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9301735Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9302077Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9302629Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9303851Z triton_flex_attention_backward_101 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9305004Z triton_flex_attention_backward_102 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9306242Z triton_flex_attention_backward_103 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9307510Z triton_flex_attention_backward_104 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9308730Z triton_flex_attention_backward_106 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9309884Z triton_flex_attention_backward_108 0.0132 ms 85.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9311035Z triton_flex_attention_backward_105 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9312184Z triton_flex_attention_backward_107 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9313397Z triton_flex_attention_backward_109 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9314553Z triton_flex_attention_backward_110 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9314865Z SingleProcess AUTOTUNE benchmarking takes 0.6614 seconds and 2.3222 seconds precompiling for 13 choices 2025-12-04T10:01:25.9315006Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9315079Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9315147Z unimplemented [] 2025-12-04T10:01:25.9315253Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9315441Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9316676Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9316786Z graph_break [] 2025-12-04T10:01:25.9316924Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9316992Z Autotune Choices Stats: 2025-12-04T10:01:25.9318401Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_118", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9318654Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9318875Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9319191Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9320326Z triton_flex_attention_118 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9321488Z triton_flex_attention_119 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9322607Z triton_flex_attention_114 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9323761Z triton_flex_attention_116 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9324917Z triton_flex_attention_117 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9326075Z triton_flex_attention_115 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9326334Z SingleProcess AUTOTUNE benchmarking takes 0.2923 seconds and 1.2549 seconds precompiling for 6 choices 2025-12-04T10:01:25.9326412Z Autotune Choices Stats: 2025-12-04T10:01:25.9327861Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_120", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.9328297Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9328633Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9329226Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9330400Z triton_flex_attention_backward_120 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9331608Z triton_flex_attention_backward_121 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9332762Z triton_flex_attention_backward_122 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9333982Z triton_flex_attention_backward_123 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9335132Z triton_flex_attention_backward_125 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9336298Z triton_flex_attention_backward_126 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9337446Z triton_flex_attention_backward_124 0.0143 ms 86.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9338641Z triton_flex_attention_backward_127 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9339796Z triton_flex_attention_backward_132 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9340992Z triton_flex_attention_backward_128 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9341239Z SingleProcess AUTOTUNE benchmarking takes 0.6626 seconds and 2.2483 seconds precompiling for 13 choices 2025-12-04T10:01:25.9341379Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9341536Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9341608Z unimplemented [] 2025-12-04T10:01:25.9341711Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9341900Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9343094Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9343155Z graph_break [] 2025-12-04T10:01:25.9343288Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9343369Z Autotune Choices Stats: 2025-12-04T10:01:25.9344766Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_137", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:25.9345008Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9345227Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9345551Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9346708Z triton_flex_attention_137 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9347872Z triton_flex_attention_138 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9349047Z triton_flex_attention_135 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9350208Z triton_flex_attention_136 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9351368Z triton_flex_attention_133 0.0123 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9352478Z triton_flex_attention_134 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9352735Z SingleProcess AUTOTUNE benchmarking takes 0.2914 seconds and 1.3372 seconds precompiling for 6 choices 2025-12-04T10:01:25.9352803Z Autotune Choices Stats: 2025-12-04T10:01:25.9354234Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_141", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9354677Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9355048Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9355927Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9357129Z triton_flex_attention_backward_141 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9358377Z triton_flex_attention_backward_140 0.0113 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9359583Z triton_flex_attention_backward_142 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9360799Z triton_flex_attention_backward_139 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9361958Z triton_flex_attention_backward_144 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9363120Z triton_flex_attention_backward_143 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9364329Z triton_flex_attention_backward_145 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9365492Z triton_flex_attention_backward_146 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9366692Z triton_flex_attention_backward_148 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9367903Z triton_flex_attention_backward_151 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9368191Z SingleProcess AUTOTUNE benchmarking takes 0.6611 seconds and 2.2435 seconds precompiling for 13 choices 2025-12-04T10:01:25.9368334Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9368408Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9368477Z unimplemented [] 2025-12-04T10:01:25.9368583Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9368772Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9369972Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9370036Z graph_break [] 2025-12-04T10:01:25.9370174Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9370243Z Autotune Choices Stats: 2025-12-04T10:01:25.9371650Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_156", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009184000082314014, "best_triton_pos": 0} 2025-12-04T10:01:25.9371905Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9372128Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9372492Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9373635Z triton_flex_attention_156 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9374788Z triton_flex_attention_157 0.0092 ms 99.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9375897Z triton_flex_attention_152 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9377110Z triton_flex_attention_154 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9378228Z triton_flex_attention_155 0.0113 ms 81.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9379351Z triton_flex_attention_153 0.0143 ms 64.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9379603Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:25.9379669Z Autotune Choices Stats: 2025-12-04T10:01:25.9381150Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_159", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9381589Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9381926Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9382517Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9383693Z triton_flex_attention_backward_159 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9384883Z triton_flex_attention_backward_158 0.0122 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9386092Z triton_flex_attention_backward_160 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9387311Z triton_flex_attention_backward_161 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9388488Z triton_flex_attention_backward_163 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9389647Z triton_flex_attention_backward_162 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9390848Z triton_flex_attention_backward_164 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9392049Z triton_flex_attention_backward_165 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9393204Z triton_flex_attention_backward_167 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9394439Z triton_flex_attention_backward_170 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9394695Z SingleProcess AUTOTUNE benchmarking takes 0.6627 seconds and 2.2511 seconds precompiling for 13 choices 2025-12-04T10:01:25.9394836Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9394908Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9394975Z unimplemented [] 2025-12-04T10:01:25.9395078Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9395269Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9396465Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9396525Z graph_break [] 2025-12-04T10:01:25.9396658Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9396724Z Autotune Choices Stats: 2025-12-04T10:01:25.9398123Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_175", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9398411Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9398644Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9398961Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9400095Z triton_flex_attention_175 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9401254Z triton_flex_attention_176 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9402403Z triton_flex_attention_173 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9403563Z triton_flex_attention_174 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9404683Z triton_flex_attention_171 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9405798Z triton_flex_attention_172 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9406050Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.2850 seconds precompiling for 6 choices 2025-12-04T10:01:25.9406121Z Autotune Choices Stats: 2025-12-04T10:01:25.9407599Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_177", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9408036Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9408425Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9408979Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9410185Z triton_flex_attention_backward_177 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9411378Z triton_flex_attention_backward_179 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9412540Z triton_flex_attention_backward_178 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9413709Z triton_flex_attention_backward_180 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9414872Z triton_flex_attention_backward_181 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9416065Z triton_flex_attention_backward_182 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9417224Z triton_flex_attention_backward_183 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9418422Z triton_flex_attention_backward_184 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9419609Z triton_flex_attention_backward_186 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9420811Z triton_flex_attention_backward_189 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9421062Z SingleProcess AUTOTUNE benchmarking takes 0.6643 seconds and 2.2075 seconds precompiling for 13 choices 2025-12-04T10:01:25.9421204Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9421274Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9421343Z unimplemented [] 2025-12-04T10:01:25.9421444Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9421630Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9422818Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9422883Z graph_break [] 2025-12-04T10:01:25.9423016Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9423085Z Autotune Choices Stats: 2025-12-04T10:01:25.9424549Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_194", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008287999778985977, "best_triton_pos": 0} 2025-12-04T10:01:25.9424797Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9425069Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9425386Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9426511Z triton_flex_attention_194 0.0083 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9427737Z triton_flex_attention_195 0.0092 ms 89.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9428901Z triton_flex_attention_193 0.0111 ms 74.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9430025Z triton_flex_attention_192 0.0113 ms 73.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9431148Z triton_flex_attention_190 0.0123 ms 67.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9432265Z triton_flex_attention_191 0.0133 ms 62.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9432549Z SingleProcess AUTOTUNE benchmarking takes 0.2929 seconds and 1.2912 seconds precompiling for 6 choices 2025-12-04T10:01:25.9432616Z Autotune Choices Stats: 2025-12-04T10:01:25.9434059Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_196", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9434528Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9434863Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9435416Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9436670Z triton_flex_attention_backward_196 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9437829Z triton_flex_attention_backward_197 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9439002Z triton_flex_attention_backward_198 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9440170Z triton_flex_attention_backward_199 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9441379Z triton_flex_attention_backward_201 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9442535Z triton_flex_attention_backward_200 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9443736Z triton_flex_attention_backward_202 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9444924Z triton_flex_attention_backward_203 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9446120Z triton_flex_attention_backward_208 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9447279Z triton_flex_attention_backward_204 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9447524Z SingleProcess AUTOTUNE benchmarking takes 0.6640 seconds and 2.1752 seconds precompiling for 13 choices 2025-12-04T10:01:25.9447659Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9447728Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9447789Z unimplemented [] 2025-12-04T10:01:25.9447895Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9448087Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9449277Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9449342Z graph_break [] 2025-12-04T10:01:25.9449511Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9449581Z Autotune Choices Stats: 2025-12-04T10:01:25.9450979Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_213", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9451260Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9451476Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9451793Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9452955Z triton_flex_attention_213 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9454110Z triton_flex_attention_214 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9455441Z triton_flex_attention_211 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9456670Z triton_flex_attention_212 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9457801Z triton_flex_attention_209 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9459001Z triton_flex_attention_210 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9459257Z SingleProcess AUTOTUNE benchmarking takes 0.2921 seconds and 1.3684 seconds precompiling for 6 choices 2025-12-04T10:01:25.9459327Z Autotune Choices Stats: 2025-12-04T10:01:25.9460767Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_215", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9461266Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9461643Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9462245Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9463431Z triton_flex_attention_backward_215 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9464602Z triton_flex_attention_backward_216 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9465771Z triton_flex_attention_backward_217 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9466970Z triton_flex_attention_backward_218 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9468196Z triton_flex_attention_backward_220 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9469395Z triton_flex_attention_backward_219 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9470588Z triton_flex_attention_backward_221 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9471798Z triton_flex_attention_backward_222 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9472959Z triton_flex_attention_backward_227 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9474128Z triton_flex_attention_backward_224 0.0134 ms 84.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9474375Z SingleProcess AUTOTUNE benchmarking takes 0.6655 seconds and 2.4502 seconds precompiling for 13 choices 2025-12-04T10:01:25.9474510Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9474579Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9474645Z unimplemented [] 2025-12-04T10:01:25.9474753Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9474939Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9476163Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9476226Z graph_break [] 2025-12-04T10:01:25.9476362Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9476427Z Autotune Choices Stats: 2025-12-04T10:01:25.9477860Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_232", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.009216000325977802, "best_triton_pos": 0} 2025-12-04T10:01:25.9478111Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9478331Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9478652Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9479848Z triton_flex_attention_232 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9480975Z triton_flex_attention_233 0.0092 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9482100Z triton_flex_attention_228 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9483221Z triton_flex_attention_230 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9484378Z triton_flex_attention_231 0.0113 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9485493Z triton_flex_attention_229 0.0143 ms 64.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9485776Z SingleProcess AUTOTUNE benchmarking takes 0.2912 seconds and 1.4103 seconds precompiling for 6 choices 2025-12-04T10:01:25.9485842Z Autotune Choices Stats: 2025-12-04T10:01:25.9487275Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_235", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.9487798Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9488128Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9488683Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9489851Z triton_flex_attention_backward_235 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9491022Z triton_flex_attention_backward_236 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9492184Z triton_flex_attention_backward_237 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9493379Z triton_flex_attention_backward_234 0.0133 ms 92.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9494538Z triton_flex_attention_backward_239 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9495732Z triton_flex_attention_backward_240 0.0142 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9496922Z triton_flex_attention_backward_238 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9498116Z triton_flex_attention_backward_241 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9499276Z triton_flex_attention_backward_243 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9500430Z triton_flex_attention_backward_242 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9500677Z SingleProcess AUTOTUNE benchmarking takes 0.6634 seconds and 2.1409 seconds precompiling for 13 choices 2025-12-04T10:01:25.9500809Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9500880Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9500942Z unimplemented [] 2025-12-04T10:01:25.9501085Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9501275Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9502464Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9502746Z graph_break [] 2025-12-04T10:01:25.9502886Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9502959Z Autotune Choices Stats: 2025-12-04T10:01:25.9504370Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_251", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9504666Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9504924Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9505232Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9506361Z triton_flex_attention_251 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9507547Z triton_flex_attention_252 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9508691Z triton_flex_attention_249 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9509854Z triton_flex_attention_250 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9510972Z triton_flex_attention_247 0.0123 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9512124Z triton_flex_attention_248 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9512373Z SingleProcess AUTOTUNE benchmarking takes 0.2920 seconds and 1.2982 seconds precompiling for 6 choices 2025-12-04T10:01:25.9512442Z Autotune Choices Stats: 2025-12-04T10:01:25.9513923Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_254", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9514388Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9514720Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9515274Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9516455Z triton_flex_attention_backward_254 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9517619Z triton_flex_attention_backward_255 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9518820Z triton_flex_attention_backward_253 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9519983Z triton_flex_attention_backward_256 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9521180Z triton_flex_attention_backward_258 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9522368Z triton_flex_attention_backward_257 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9523557Z triton_flex_attention_backward_259 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9524714Z triton_flex_attention_backward_260 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9525885Z triton_flex_attention_backward_262 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9527049Z triton_flex_attention_backward_265 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9527336Z SingleProcess AUTOTUNE benchmarking takes 0.6632 seconds and 2.1866 seconds precompiling for 13 choices 2025-12-04T10:01:25.9527471Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9527538Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9527606Z unimplemented [] 2025-12-04T10:01:25.9527706Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9527890Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9529116Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9529177Z graph_break [] 2025-12-04T10:01:25.9529310Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9529376Z Autotune Choices Stats: 2025-12-04T10:01:25.9530834Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_270", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9531118Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9531343Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9531653Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9532783Z triton_flex_attention_270 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9533900Z triton_flex_attention_271 0.0091 ms 89.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9535025Z triton_flex_attention_266 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9536182Z triton_flex_attention_268 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9537311Z triton_flex_attention_269 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9538480Z triton_flex_attention_267 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9538725Z SingleProcess AUTOTUNE benchmarking takes 0.2911 seconds and 1.2890 seconds precompiling for 6 choices 2025-12-04T10:01:25.9538837Z Autotune Choices Stats: 2025-12-04T10:01:25.9540312Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_272", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.9540751Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9541086Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9541642Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9542819Z triton_flex_attention_backward_272 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9544015Z triton_flex_attention_backward_273 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9545181Z triton_flex_attention_backward_274 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9546377Z triton_flex_attention_backward_275 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9547695Z triton_flex_attention_backward_277 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9548901Z triton_flex_attention_backward_276 0.0142 ms 86.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9550058Z triton_flex_attention_backward_278 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9551220Z triton_flex_attention_backward_279 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9552386Z triton_flex_attention_backward_284 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9553576Z triton_flex_attention_backward_280 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9553832Z SingleProcess AUTOTUNE benchmarking takes 0.6639 seconds and 2.2941 seconds precompiling for 13 choices 2025-12-04T10:01:25.9553997Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9554072Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9554140Z unimplemented [] 2025-12-04T10:01:25.9554242Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9554429Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9555904Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9555971Z graph_break [] 2025-12-04T10:01:25.9556107Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9556340Z Autotune Choices Stats: 2025-12-04T10:01:25.9557766Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_289", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9558010Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9558229Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9558569Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9559714Z triton_flex_attention_289 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9560842Z triton_flex_attention_290 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9562020Z triton_flex_attention_285 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9563146Z triton_flex_attention_287 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9564337Z triton_flex_attention_288 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9565489Z triton_flex_attention_286 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9565786Z SingleProcess AUTOTUNE benchmarking takes 0.2910 seconds and 1.2965 seconds precompiling for 6 choices 2025-12-04T10:01:25.9565856Z Autotune Choices Stats: 2025-12-04T10:01:25.9567305Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_292", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9567742Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9568078Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9568650Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9569831Z triton_flex_attention_backward_292 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9571052Z triton_flex_attention_backward_291 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9572224Z triton_flex_attention_backward_293 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9573416Z triton_flex_attention_backward_294 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9574644Z triton_flex_attention_backward_295 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9575806Z triton_flex_attention_backward_296 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9576977Z triton_flex_attention_backward_298 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9578144Z triton_flex_attention_backward_297 0.0135 ms 83.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9579333Z triton_flex_attention_backward_300 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9580500Z triton_flex_attention_backward_303 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9580787Z SingleProcess AUTOTUNE benchmarking takes 0.6662 seconds and 2.2381 seconds precompiling for 13 choices 2025-12-04T10:01:25.9580924Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9580997Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9581070Z unimplemented [] 2025-12-04T10:01:25.9581173Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9581360Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9582584Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9582691Z graph_break [] 2025-12-04T10:01:25.9582825Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9582894Z Autotune Choices Stats: 2025-12-04T10:01:25.9584300Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_309", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9584554Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9584777Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9585095Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9586227Z triton_flex_attention_309 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9587484Z triton_flex_attention_308 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9588621Z triton_flex_attention_307 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9589778Z triton_flex_attention_306 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9590932Z triton_flex_attention_304 0.0122 ms 67.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9592087Z triton_flex_attention_305 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9592347Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.2799 seconds precompiling for 6 choices 2025-12-04T10:01:25.9592414Z Autotune Choices Stats: 2025-12-04T10:01:25.9593867Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_310", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9594304Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9594637Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9595202Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9596419Z triton_flex_attention_backward_310 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9597588Z triton_flex_attention_backward_311 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9598792Z triton_flex_attention_backward_312 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9599980Z triton_flex_attention_backward_313 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9601207Z triton_flex_attention_backward_315 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9602374Z triton_flex_attention_backward_314 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9603549Z triton_flex_attention_backward_316 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9604747Z triton_flex_attention_backward_317 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9605912Z triton_flex_attention_backward_319 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9607107Z triton_flex_attention_backward_322 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9607349Z SingleProcess AUTOTUNE benchmarking takes 0.6659 seconds and 2.2084 seconds precompiling for 13 choices 2025-12-04T10:01:25.9607481Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9607549Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9607616Z unimplemented [] 2025-12-04T10:01:25.9607717Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9607971Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9609164Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9609225Z graph_break [] 2025-12-04T10:01:25.9609357Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9609424Z Autotune Choices Stats: 2025-12-04T10:01:25.9610830Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_327", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9611082Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9611301Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9611619Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9612793Z triton_flex_attention_327 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9613923Z triton_flex_attention_328 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9615084Z triton_flex_attention_325 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9616214Z triton_flex_attention_326 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9617413Z triton_flex_attention_323 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9618536Z triton_flex_attention_324 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9618789Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.3999 seconds precompiling for 6 choices 2025-12-04T10:01:25.9618867Z Autotune Choices Stats: 2025-12-04T10:01:25.9620317Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_329", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9620762Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9621106Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9621702Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9622884Z triton_flex_attention_backward_329 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9624082Z triton_flex_attention_backward_330 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9625283Z triton_flex_attention_backward_331 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9626486Z triton_flex_attention_backward_332 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9627713Z triton_flex_attention_backward_334 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9628881Z triton_flex_attention_backward_333 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9630042Z triton_flex_attention_backward_335 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9631242Z triton_flex_attention_backward_336 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9632398Z triton_flex_attention_backward_338 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9633597Z triton_flex_attention_backward_337 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9633936Z SingleProcess AUTOTUNE benchmarking takes 0.6646 seconds and 2.2443 seconds precompiling for 13 choices 2025-12-04T10:01:25.9634074Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9634144Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9634218Z unimplemented [] 2025-12-04T10:01:25.9634319Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9634508Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9635702Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9635767Z graph_break [] 2025-12-04T10:01:25.9635905Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9635973Z Autotune Choices Stats: 2025-12-04T10:01:25.9637376Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_346", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9637623Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9637847Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9638167Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9639355Z triton_flex_attention_346 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9640481Z triton_flex_attention_347 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9641640Z triton_flex_attention_345 0.0112 ms 72.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9642787Z triton_flex_attention_342 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9643947Z triton_flex_attention_344 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9645065Z triton_flex_attention_343 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9645316Z SingleProcess AUTOTUNE benchmarking takes 0.2918 seconds and 1.3055 seconds precompiling for 6 choices 2025-12-04T10:01:25.9645386Z Autotune Choices Stats: 2025-12-04T10:01:25.9646841Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_348", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.9647311Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9647650Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9648205Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9649419Z triton_flex_attention_backward_348 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9650583Z triton_flex_attention_backward_349 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9651817Z triton_flex_attention_backward_350 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9652985Z triton_flex_attention_backward_351 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9654156Z triton_flex_attention_backward_353 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9655578Z triton_flex_attention_backward_355 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9656825Z triton_flex_attention_backward_352 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9657992Z triton_flex_attention_backward_354 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9659215Z triton_flex_attention_backward_360 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9660421Z triton_flex_attention_backward_356 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9660716Z SingleProcess AUTOTUNE benchmarking takes 0.6644 seconds and 2.3797 seconds precompiling for 13 choices 2025-12-04T10:01:25.9660852Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9660924Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9660992Z unimplemented [] 2025-12-04T10:01:25.9661095Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9661279Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9662465Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9662530Z graph_break [] 2025-12-04T10:01:25.9662662Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9662729Z Autotune Choices Stats: 2025-12-04T10:01:25.9664132Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_365", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9664378Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9664635Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9664955Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9666089Z triton_flex_attention_365 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9667331Z triton_flex_attention_366 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9668498Z triton_flex_attention_364 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9669661Z triton_flex_attention_361 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9670790Z triton_flex_attention_363 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9671910Z triton_flex_attention_362 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9672159Z SingleProcess AUTOTUNE benchmarking takes 0.2926 seconds and 1.2859 seconds precompiling for 6 choices 2025-12-04T10:01:25.9672225Z Autotune Choices Stats: 2025-12-04T10:01:25.9673708Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_369", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9674139Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9674474Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9675068Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9676256Z triton_flex_attention_backward_369 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9677460Z triton_flex_attention_backward_367 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9678658Z triton_flex_attention_backward_368 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9679820Z triton_flex_attention_backward_370 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9680979Z triton_flex_attention_backward_371 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9682175Z triton_flex_attention_backward_372 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9683334Z triton_flex_attention_backward_373 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9684533Z triton_flex_attention_backward_374 0.0133 ms 84.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9685717Z triton_flex_attention_backward_376 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9686925Z triton_flex_attention_backward_379 0.0145 ms 77.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9687167Z SingleProcess AUTOTUNE benchmarking takes 0.6653 seconds and 2.2670 seconds precompiling for 13 choices 2025-12-04T10:01:25.9687303Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9687372Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9687436Z unimplemented [] 2025-12-04T10:01:25.9687544Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9687726Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9688920Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 27), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 8), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 5), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9688981Z graph_break [] 2025-12-04T10:01:25.9689114Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9689181Z Autotune Choices Stats: 2025-12-04T10:01:25.9690626Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_384", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9690878Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9691095Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9691416Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9692579Z triton_flex_attention_384 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9693775Z triton_flex_attention_385 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9694961Z triton_flex_attention_380 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9696087Z triton_flex_attention_382 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9697218Z triton_flex_attention_383 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9698343Z triton_flex_attention_381 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9698597Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:25.9698664Z Autotune Choices Stats: 2025-12-04T10:01:25.9700164Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_387", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.9700636Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9700976Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9701536Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9702746Z triton_flex_attention_backward_387 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9703942Z triton_flex_attention_backward_388 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9705113Z triton_flex_attention_backward_389 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9706272Z triton_flex_attention_backward_386 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9707519Z triton_flex_attention_backward_390 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9708733Z triton_flex_attention_backward_391 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9709938Z triton_flex_attention_backward_392 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9711103Z triton_flex_attention_backward_393 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9712323Z triton_flex_attention_backward_395 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9713493Z triton_flex_attention_backward_398 0.0153 ms 80.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9713741Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2505 seconds precompiling for 13 choices 2025-12-04T10:01:25.9713875Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9713941Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9714004Z unimplemented [] 2025-12-04T10:01:25.9714110Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9714293Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9715476Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9715538Z graph_break [] 2025-12-04T10:01:25.9715668Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9715735Z Autotune Choices Stats: 2025-12-04T10:01:25.9717178Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_403", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9717472Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9717686Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9718006Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9719145Z triton_flex_attention_403 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9720349Z triton_flex_attention_404 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9721473Z triton_flex_attention_399 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9722608Z triton_flex_attention_401 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9723729Z triton_flex_attention_402 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9724887Z triton_flex_attention_400 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9725133Z SingleProcess AUTOTUNE benchmarking takes 0.2925 seconds and 1.2817 seconds precompiling for 6 choices 2025-12-04T10:01:25.9725199Z Autotune Choices Stats: 2025-12-04T10:01:25.9726648Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_407", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.012128000147640705, "best_triton_pos": 0} 2025-12-04T10:01:25.9727129Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9727456Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9728050Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9729282Z triton_flex_attention_backward_407 0.0121 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9730454Z triton_flex_attention_backward_405 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9731617Z triton_flex_attention_backward_406 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9732778Z triton_flex_attention_backward_408 0.0123 ms 98.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9733967Z triton_flex_attention_backward_410 0.0133 ms 91.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9735127Z triton_flex_attention_backward_409 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9736334Z triton_flex_attention_backward_411 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9737525Z triton_flex_attention_backward_412 0.0133 ms 91.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9738716Z triton_flex_attention_backward_414 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9739891Z triton_flex_attention_backward_417 0.0143 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9740137Z SingleProcess AUTOTUNE benchmarking takes 0.6636 seconds and 2.4723 seconds precompiling for 13 choices 2025-12-04T10:01:25.9740274Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9740342Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9740406Z unimplemented [] 2025-12-04T10:01:25.9740512Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9740695Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9741928Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9741990Z graph_break [] 2025-12-04T10:01:25.9742118Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9742190Z Autotune Choices Stats: 2025-12-04T10:01:25.9743591Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_422", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9743887Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9744107Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9744425Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9745586Z triton_flex_attention_422 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9746746Z triton_flex_attention_423 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9747912Z triton_flex_attention_418 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9749054Z triton_flex_attention_420 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9750182Z triton_flex_attention_421 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9751341Z triton_flex_attention_419 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9751588Z SingleProcess AUTOTUNE benchmarking takes 0.2931 seconds and 1.3362 seconds precompiling for 6 choices 2025-12-04T10:01:25.9751690Z Autotune Choices Stats: 2025-12-04T10:01:25.9753136Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_425", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:25.9753573Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9753964Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9754533Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9756001Z triton_flex_attention_backward_425 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9757187Z triton_flex_attention_backward_426 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9758357Z triton_flex_attention_backward_427 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9759592Z triton_flex_attention_backward_424 0.0132 ms 93.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9760768Z triton_flex_attention_backward_429 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9761990Z triton_flex_attention_backward_428 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9763217Z triton_flex_attention_backward_430 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9764434Z triton_flex_attention_backward_431 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9765596Z triton_flex_attention_backward_433 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9766773Z triton_flex_attention_backward_432 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9767016Z SingleProcess AUTOTUNE benchmarking takes 0.6652 seconds and 2.2612 seconds precompiling for 13 choices 2025-12-04T10:01:25.9767160Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9767229Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9767293Z unimplemented [] 2025-12-04T10:01:25.9767407Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9767591Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9768816Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9768879Z graph_break [] 2025-12-04T10:01:25.9769043Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9769124Z Autotune Choices Stats: 2025-12-04T10:01:25.9770528Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_441", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9770776Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9771003Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9771391Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9772524Z triton_flex_attention_441 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9773647Z triton_flex_attention_442 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9774775Z triton_flex_attention_437 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9775901Z triton_flex_attention_439 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9777056Z triton_flex_attention_440 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9778195Z triton_flex_attention_438 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9778487Z SingleProcess AUTOTUNE benchmarking takes 0.2928 seconds and 1.3107 seconds precompiling for 6 choices 2025-12-04T10:01:25.9778560Z Autotune Choices Stats: 2025-12-04T10:01:25.9780033Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_444", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9780505Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9780832Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9781399Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9782574Z triton_flex_attention_backward_444 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9783745Z triton_flex_attention_backward_443 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9784949Z triton_flex_attention_backward_445 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9786106Z triton_flex_attention_backward_446 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9787385Z triton_flex_attention_backward_448 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9788585Z triton_flex_attention_backward_449 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9789792Z triton_flex_attention_backward_447 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9790950Z triton_flex_attention_backward_450 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9792107Z triton_flex_attention_backward_452 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9793274Z triton_flex_attention_backward_455 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9793523Z SingleProcess AUTOTUNE benchmarking takes 0.6641 seconds and 2.2679 seconds precompiling for 13 choices 2025-12-04T10:01:25.9793691Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9793761Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9793825Z unimplemented [] 2025-12-04T10:01:25.9793934Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9794119Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9795310Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9795423Z graph_break [] 2025-12-04T10:01:25.9795552Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9795637Z Autotune Choices Stats: 2025-12-04T10:01:25.9797079Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_460", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9797360Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9797584Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9797902Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9799038Z triton_flex_attention_460 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9800162Z triton_flex_attention_461 0.0092 ms 88.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9801292Z triton_flex_attention_458 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9802457Z triton_flex_attention_459 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9803568Z triton_flex_attention_456 0.0113 ms 72.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9804732Z triton_flex_attention_457 0.0143 ms 57.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9804985Z SingleProcess AUTOTUNE benchmarking takes 0.2915 seconds and 1.2978 seconds precompiling for 6 choices 2025-12-04T10:01:25.9805052Z Autotune Choices Stats: 2025-12-04T10:01:25.9806523Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_463", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9806995Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9807323Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9807894Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9809070Z triton_flex_attention_backward_463 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9810246Z triton_flex_attention_backward_464 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9811465Z triton_flex_attention_backward_465 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9812628Z triton_flex_attention_backward_467 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9813837Z triton_flex_attention_backward_462 0.0132 ms 85.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9815028Z triton_flex_attention_backward_469 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9816224Z triton_flex_attention_backward_468 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9817393Z triton_flex_attention_backward_466 0.0143 ms 78.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9818552Z triton_flex_attention_backward_471 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9819743Z triton_flex_attention_backward_474 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9819987Z SingleProcess AUTOTUNE benchmarking takes 0.6661 seconds and 2.3240 seconds precompiling for 13 choices 2025-12-04T10:01:25.9820121Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9820191Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9820254Z unimplemented [] 2025-12-04T10:01:25.9820407Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9820595Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9821783Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9821844Z graph_break [] 2025-12-04T10:01:25.9821972Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9822045Z Autotune Choices Stats: 2025-12-04T10:01:25.9823478Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_479", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.008191999979317188, "best_triton_pos": 0} 2025-12-04T10:01:25.9823764Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9823982Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9824302Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9825440Z triton_flex_attention_479 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9826564Z triton_flex_attention_480 0.0082 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9827823Z triton_flex_attention_477 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9828959Z triton_flex_attention_478 0.0102 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9830152Z triton_flex_attention_475 0.0113 ms 72.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9831280Z triton_flex_attention_476 0.0133 ms 61.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9831591Z SingleProcess AUTOTUNE benchmarking takes 0.2922 seconds and 1.3007 seconds precompiling for 6 choices 2025-12-04T10:01:25.9831665Z Autotune Choices Stats: 2025-12-04T10:01:25.9833113Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_482", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011168000288307667, "best_triton_pos": 0} 2025-12-04T10:01:25.9833553Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9833883Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9834451Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9835622Z triton_flex_attention_backward_482 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9836838Z triton_flex_attention_backward_483 0.0112 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9838002Z triton_flex_attention_backward_481 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9839201Z triton_flex_attention_backward_484 0.0113 ms 99.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9840400Z triton_flex_attention_backward_485 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9841590Z triton_flex_attention_backward_486 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9842753Z triton_flex_attention_backward_487 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9843917Z triton_flex_attention_backward_488 0.0123 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9845105Z triton_flex_attention_backward_490 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9846263Z triton_flex_attention_backward_493 0.0133 ms 83.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9846551Z SingleProcess AUTOTUNE benchmarking takes 0.6645 seconds and 2.3331 seconds precompiling for 13 choices 2025-12-04T10:01:25.9846686Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9846755Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9846815Z unimplemented [] 2025-12-04T10:01:25.9846923Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9847113Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9848302Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9848434Z graph_break [] 2025-12-04T10:01:25.9848564Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9848635Z Autotune Choices Stats: 2025-12-04T10:01:25.9850039Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_498", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.9850289Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9850512Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9850832Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9851969Z triton_flex_attention_498 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9853091Z triton_flex_attention_499 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9854251Z triton_flex_attention_496 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9855640Z triton_flex_attention_494 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9856885Z triton_flex_attention_497 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9858064Z triton_flex_attention_495 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9858403Z SingleProcess AUTOTUNE benchmarking takes 0.2945 seconds and 1.2849 seconds precompiling for 6 choices 2025-12-04T10:01:25.9858477Z Autotune Choices Stats: 2025-12-04T10:01:25.9859935Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_501", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:25.9860382Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9860713Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9861277Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9862521Z triton_flex_attention_backward_501 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9863693Z triton_flex_attention_backward_502 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9864892Z triton_flex_attention_backward_503 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9866079Z triton_flex_attention_backward_500 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9867382Z triton_flex_attention_backward_505 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9868555Z triton_flex_attention_backward_504 0.0153 ms 87.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9869726Z triton_flex_attention_backward_506 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9870890Z triton_flex_attention_backward_507 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9872091Z triton_flex_attention_backward_508 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9873255Z triton_flex_attention_backward_509 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9873547Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2710 seconds precompiling for 13 choices 2025-12-04T10:01:25.9873688Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9873758Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9873821Z unimplemented [] 2025-12-04T10:01:25.9873934Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9874121Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9875339Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9875442Z graph_break [] 2025-12-04T10:01:25.9875573Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9875648Z Autotune Choices Stats: 2025-12-04T10:01:25.9877070Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_517", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.9877325Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9877547Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9877873Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9879004Z triton_flex_attention_517 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9880161Z triton_flex_attention_518 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9881410Z triton_flex_attention_515 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9882571Z triton_flex_attention_513 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9883733Z triton_flex_attention_514 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9884910Z triton_flex_attention_516 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9885159Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3334 seconds precompiling for 6 choices 2025-12-04T10:01:25.9885234Z Autotune Choices Stats: 2025-12-04T10:01:25.9886682Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_520", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.9887128Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9887457Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9888026Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9889233Z triton_flex_attention_backward_520 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9890458Z triton_flex_attention_backward_521 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9891622Z triton_flex_attention_backward_522 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9892881Z triton_flex_attention_backward_519 0.0153 ms 93.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9894045Z triton_flex_attention_backward_523 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9895217Z triton_flex_attention_backward_524 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9896380Z triton_flex_attention_backward_526 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9897574Z triton_flex_attention_backward_525 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9898739Z triton_flex_attention_backward_528 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9899935Z triton_flex_attention_backward_527 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9900185Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.2506 seconds precompiling for 13 choices 2025-12-04T10:01:25.9900320Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9900435Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9900540Z unimplemented [] 2025-12-04T10:01:25.9900652Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9900838Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9902020Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9902088Z graph_break [] 2025-12-04T10:01:25.9902213Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9902289Z Autotune Choices Stats: 2025-12-04T10:01:25.9903693Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_536", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:25.9903941Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9904159Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9904481Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9905659Z triton_flex_attention_536 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9906786Z triton_flex_attention_537 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9908018Z triton_flex_attention_534 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9909182Z triton_flex_attention_532 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9910343Z triton_flex_attention_535 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9911475Z triton_flex_attention_533 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9911720Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3133 seconds precompiling for 6 choices 2025-12-04T10:01:25.9911793Z Autotune Choices Stats: 2025-12-04T10:01:25.9913238Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_541", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013344000093638897, "best_triton_pos": 0} 2025-12-04T10:01:25.9913682Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9914044Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9914611Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9915782Z triton_flex_attention_backward_541 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9916992Z triton_flex_attention_backward_538 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9918184Z triton_flex_attention_backward_539 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9919394Z triton_flex_attention_backward_540 0.0143 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9920562Z triton_flex_attention_backward_543 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9921722Z triton_flex_attention_backward_545 0.0154 ms 86.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9922919Z triton_flex_attention_backward_544 0.0155 ms 86.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9924078Z triton_flex_attention_backward_542 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9925296Z triton_flex_attention_backward_547 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9926483Z triton_flex_attention_backward_546 0.0174 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9926773Z SingleProcess AUTOTUNE benchmarking takes 0.6689 seconds and 2.3413 seconds precompiling for 13 choices 2025-12-04T10:01:25.9926903Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9926978Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9927044Z unimplemented [] 2025-12-04T10:01:25.9927153Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9927336Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9928513Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9928584Z graph_break [] 2025-12-04T10:01:25.9928712Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9928783Z Autotune Choices Stats: 2025-12-04T10:01:25.9930185Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_555", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9930442Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9930662Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9931021Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9932161Z triton_flex_attention_555 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9933328Z triton_flex_attention_556 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9934451Z triton_flex_attention_553 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9935639Z triton_flex_attention_551 0.0144 ms 78.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9936763Z triton_flex_attention_554 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9937897Z triton_flex_attention_552 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9938139Z SingleProcess AUTOTUNE benchmarking takes 0.2940 seconds and 1.3263 seconds precompiling for 6 choices 2025-12-04T10:01:25.9938211Z Autotune Choices Stats: 2025-12-04T10:01:25.9939654Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_557", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.9940149Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9940486Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9941061Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9942270Z triton_flex_attention_backward_557 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9943475Z triton_flex_attention_backward_558 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9944690Z triton_flex_attention_backward_559 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9945861Z triton_flex_attention_backward_560 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9947034Z triton_flex_attention_backward_562 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9948264Z triton_flex_attention_backward_561 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9949463Z triton_flex_attention_backward_563 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9950625Z triton_flex_attention_backward_564 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9951987Z triton_flex_attention_backward_566 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9953229Z triton_flex_attention_backward_565 0.0173 ms 82.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9953476Z SingleProcess AUTOTUNE benchmarking takes 0.6683 seconds and 2.1930 seconds precompiling for 13 choices 2025-12-04T10:01:25.9953605Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9953685Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9953748Z unimplemented [] 2025-12-04T10:01:25.9953856Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9954044Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9955435Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9955557Z graph_break [] 2025-12-04T10:01:25.9955705Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9955779Z Autotune Choices Stats: 2025-12-04T10:01:25.9957198Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_574", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9957538Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9957766Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9958089Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9959228Z triton_flex_attention_574 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9960407Z triton_flex_attention_575 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9961574Z triton_flex_attention_572 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9962756Z triton_flex_attention_570 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9963881Z triton_flex_attention_573 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9965008Z triton_flex_attention_571 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9965256Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.3286 seconds precompiling for 6 choices 2025-12-04T10:01:25.9965328Z Autotune Choices Stats: 2025-12-04T10:01:25.9966847Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_579", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.014303999952971935, "best_triton_pos": 0} 2025-12-04T10:01:25.9967294Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9967662Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9968228Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9969432Z triton_flex_attention_backward_579 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9970637Z triton_flex_attention_backward_576 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9971797Z triton_flex_attention_backward_577 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9972977Z triton_flex_attention_backward_578 0.0143 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9974145Z triton_flex_attention_backward_581 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9975351Z triton_flex_attention_backward_583 0.0154 ms 93.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:25.9976525Z triton_flex_attention_backward_580 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9977723Z triton_flex_attention_backward_582 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9978921Z triton_flex_attention_backward_585 0.0164 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9980122Z triton_flex_attention_backward_588 0.0174 ms 82.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:25.9980367Z SingleProcess AUTOTUNE benchmarking takes 0.6677 seconds and 2.2260 seconds precompiling for 13 choices 2025-12-04T10:01:25.9980499Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:25.9980573Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:25.9980633Z unimplemented [] 2025-12-04T10:01:25.9980738Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:25.9980921Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:25.9982104Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:25.9982184Z graph_break [] 2025-12-04T10:01:25.9982317Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:25.9982390Z Autotune Choices Stats: 2025-12-04T10:01:25.9983816Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_593", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:25.9984063Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9984284Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9984633Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9985792Z triton_flex_attention_593 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9986944Z triton_flex_attention_594 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9988194Z triton_flex_attention_591 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:25.9989315Z triton_flex_attention_589 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9990440Z triton_flex_attention_592 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:25.9991567Z triton_flex_attention_590 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9991852Z SingleProcess AUTOTUNE benchmarking takes 0.2943 seconds and 1.2961 seconds precompiling for 6 choices 2025-12-04T10:01:25.9991926Z Autotune Choices Stats: 2025-12-04T10:01:25.9993366Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_595", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:25.9993848Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:25.9994180Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:25.9994746Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:25.9995950Z triton_flex_attention_backward_595 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:25.9997165Z triton_flex_attention_backward_596 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:25.9998334Z triton_flex_attention_backward_597 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:25.9999512Z triton_flex_attention_backward_598 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0000717Z triton_flex_attention_backward_600 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0001878Z triton_flex_attention_backward_601 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0003076Z triton_flex_attention_backward_602 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0004263Z triton_flex_attention_backward_599 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0005459Z triton_flex_attention_backward_604 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0006618Z triton_flex_attention_backward_603 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0006867Z SingleProcess AUTOTUNE benchmarking takes 0.6685 seconds and 2.3242 seconds precompiling for 13 choices 2025-12-04T10:01:26.0006999Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:26.0007075Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:26.0007137Z unimplemented [] 2025-12-04T10:01:26.0007237Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:26.0007430Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:26.0012677Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:26.0012784Z graph_break [] 2025-12-04T10:01:26.0013008Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:26.0013090Z Autotune Choices Stats: 2025-12-04T10:01:26.0014529Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_612", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:26.0014836Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0015063Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0015380Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0016568Z triton_flex_attention_612 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0017727Z triton_flex_attention_613 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0018852Z triton_flex_attention_608 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0019981Z triton_flex_attention_610 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:26.0021094Z triton_flex_attention_611 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:26.0022250Z triton_flex_attention_609 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0022502Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3064 seconds precompiling for 6 choices 2025-12-04T10:01:26.0022582Z Autotune Choices Stats: 2025-12-04T10:01:26.0024021Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_616", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:26.0024502Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0024851Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0025498Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0026681Z triton_flex_attention_backward_616 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0027944Z triton_flex_attention_backward_615 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0029105Z triton_flex_attention_backward_617 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0030302Z triton_flex_attention_backward_614 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0031457Z triton_flex_attention_backward_619 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0032656Z triton_flex_attention_backward_620 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0033816Z triton_flex_attention_backward_618 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0035035Z triton_flex_attention_backward_621 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0036205Z triton_flex_attention_backward_623 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0037362Z triton_flex_attention_backward_622 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0037619Z SingleProcess AUTOTUNE benchmarking takes 0.6701 seconds and 2.3164 seconds precompiling for 13 choices 2025-12-04T10:01:26.0037756Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:26.0037835Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:26.0037901Z unimplemented [] 2025-12-04T10:01:26.0038012Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:26.0038212Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:26.0039435Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:26.0039519Z graph_break [] 2025-12-04T10:01:26.0039658Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:26.0039729Z Autotune Choices Stats: 2025-12-04T10:01:26.0041138Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_631", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:26.0041421Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0041649Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0041961Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0043169Z triton_flex_attention_631 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0044291Z triton_flex_attention_632 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0045421Z triton_flex_attention_629 0.0142 ms 79.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:26.0046544Z triton_flex_attention_627 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0047703Z triton_flex_attention_630 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:26.0048840Z triton_flex_attention_628 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0049120Z SingleProcess AUTOTUNE benchmarking takes 0.2935 seconds and 1.2948 seconds precompiling for 6 choices 2025-12-04T10:01:26.0049196Z Autotune Choices Stats: 2025-12-04T10:01:26.0050635Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_634", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:26.0051110Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0051477Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0052037Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0053216Z triton_flex_attention_backward_634 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0054379Z triton_flex_attention_backward_635 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0055823Z triton_flex_attention_backward_636 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0057083Z triton_flex_attention_backward_633 0.0144 ms 99.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0058245Z triton_flex_attention_backward_638 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0059475Z triton_flex_attention_backward_640 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0060670Z triton_flex_attention_backward_637 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0061883Z triton_flex_attention_backward_639 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0063047Z triton_flex_attention_backward_642 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0064196Z triton_flex_attention_backward_641 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0064457Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.2589 seconds precompiling for 13 choices 2025-12-04T10:01:26.0064595Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:26.0064675Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:26.0064739Z unimplemented [] 2025-12-04T10:01:26.0064883Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:26.0065081Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:26.0066267Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:26.0066374Z graph_break [] 2025-12-04T10:01:26.0066504Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:26.0066573Z Autotune Choices Stats: 2025-12-04T10:01:26.0068051Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_650", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:26.0068300Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0068612Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0068931Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0070069Z triton_flex_attention_650 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0071195Z triton_flex_attention_651 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0072326Z triton_flex_attention_646 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0073482Z triton_flex_attention_648 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:26.0074597Z triton_flex_attention_649 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:26.0075762Z triton_flex_attention_647 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0076010Z SingleProcess AUTOTUNE benchmarking takes 0.2938 seconds and 1.3235 seconds precompiling for 6 choices 2025-12-04T10:01:26.0076085Z Autotune Choices Stats: 2025-12-04T10:01:26.0077557Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_653", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:26.0078035Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0078365Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0078926Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0080112Z triton_flex_attention_backward_653 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0081284Z triton_flex_attention_backward_654 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0082466Z triton_flex_attention_backward_655 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0083626Z triton_flex_attention_backward_652 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0084815Z triton_flex_attention_backward_656 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0086002Z triton_flex_attention_backward_657 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0087210Z triton_flex_attention_backward_659 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0088361Z triton_flex_attention_backward_658 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0089528Z triton_flex_attention_backward_661 0.0164 ms 80.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0090683Z triton_flex_attention_backward_660 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0090985Z SingleProcess AUTOTUNE benchmarking takes 0.6697 seconds and 2.4000 seconds precompiling for 13 choices 2025-12-04T10:01:26.0091132Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:26.0091211Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:26.0091276Z unimplemented [] 2025-12-04T10:01:26.0091378Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:26.0091571Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:26.0092794Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:26.0092863Z graph_break [] 2025-12-04T10:01:26.0092989Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:26.0093055Z Autotune Choices Stats: 2025-12-04T10:01:26.0094490Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_669", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:26.0094769Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0094995Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0095305Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0096453Z triton_flex_attention_669 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0097583Z triton_flex_attention_670 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0098704Z triton_flex_attention_665 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0099854Z triton_flex_attention_667 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:26.0100984Z triton_flex_attention_668 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:26.0102135Z triton_flex_attention_666 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0102379Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.2940 seconds precompiling for 6 choices 2025-12-04T10:01:26.0102486Z Autotune Choices Stats: 2025-12-04T10:01:26.0103991Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_673", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:26.0104441Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0104770Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0105336Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0106513Z triton_flex_attention_backward_673 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0107792Z triton_flex_attention_backward_674 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0108964Z triton_flex_attention_backward_671 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0110164Z triton_flex_attention_backward_672 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0111348Z triton_flex_attention_backward_676 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0112541Z triton_flex_attention_backward_675 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0113700Z triton_flex_attention_backward_677 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0114859Z triton_flex_attention_backward_678 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0116014Z triton_flex_attention_backward_680 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0117216Z triton_flex_attention_backward_683 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:26.0117467Z SingleProcess AUTOTUNE benchmarking takes 0.6679 seconds and 2.3879 seconds precompiling for 13 choices 2025-12-04T10:01:26.0117598Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:26.0117708Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:26.0117771Z unimplemented [] 2025-12-04T10:01:26.0117873Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:26.0118063Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:26.0119249Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:26.0119327Z graph_break [] 2025-12-04T10:01:26.0119457Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:26.0119564Z Autotune Choices Stats: 2025-12-04T10:01:26.0121014Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_689", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010367999784648418, "best_triton_pos": 0} 2025-12-04T10:01:26.0121259Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0121484Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0121803Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0122935Z triton_flex_attention_689 0.0104 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0124048Z triton_flex_attention_688 0.0113 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0125209Z triton_flex_attention_684 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0126330Z triton_flex_attention_686 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:26.0127488Z triton_flex_attention_687 0.0143 ms 72.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:26.0128644Z triton_flex_attention_685 0.0164 ms 63.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0128925Z SingleProcess AUTOTUNE benchmarking takes 0.2939 seconds and 1.3120 seconds precompiling for 6 choices 2025-12-04T10:01:26.0128992Z Autotune Choices Stats: 2025-12-04T10:01:26.0130432Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_690", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:26.0130877Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0131207Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0131759Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0132930Z triton_flex_attention_backward_690 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0134160Z triton_flex_attention_backward_691 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0135337Z triton_flex_attention_backward_692 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0136549Z triton_flex_attention_backward_693 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0137781Z triton_flex_attention_backward_695 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0138943Z triton_flex_attention_backward_696 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0140115Z triton_flex_attention_backward_697 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0141281Z triton_flex_attention_backward_694 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0142483Z triton_flex_attention_backward_699 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0143642Z triton_flex_attention_backward_702 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:26.0143933Z SingleProcess AUTOTUNE benchmarking takes 0.6688 seconds and 2.3417 seconds precompiling for 13 choices 2025-12-04T10:01:26.0144065Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:26.0144146Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:26.0144209Z unimplemented [] 2025-12-04T10:01:26.0144311Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:26.0144502Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:26.0145725Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:26.0145836Z graph_break [] 2025-12-04T10:01:26.0145969Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:26.0146038Z Autotune Choices Stats: 2025-12-04T10:01:26.0147528Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_707", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:26.0147795Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0148029Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0148344Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0149489Z triton_flex_attention_707 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0150643Z triton_flex_attention_708 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0151772Z triton_flex_attention_705 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:26.0152928Z triton_flex_attention_706 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:26.0154109Z triton_flex_attention_703 0.0144 ms 78.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0155480Z triton_flex_attention_704 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0155778Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.2972 seconds precompiling for 6 choices 2025-12-04T10:01:26.0155852Z Autotune Choices Stats: 2025-12-04T10:01:26.0157312Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_709", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:26.0157757Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0158088Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0158645Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0159902Z triton_flex_attention_backward_709 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0161070Z triton_flex_attention_backward_710 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0162287Z triton_flex_attention_backward_711 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0163492Z triton_flex_attention_backward_712 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0164693Z triton_flex_attention_backward_714 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0165854Z triton_flex_attention_backward_715 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0167011Z triton_flex_attention_backward_713 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0168199Z triton_flex_attention_backward_716 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0169354Z triton_flex_attention_backward_721 0.0174 ms 82.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:26.0170558Z triton_flex_attention_backward_717 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0170811Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.1509 seconds precompiling for 13 choices 2025-12-04T10:01:26.0170947Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:26.0171023Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:26.0171087Z unimplemented [] 2025-12-04T10:01:26.0171193Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:26.0171458Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:26.0172645Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 29), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 10), ('coordesc_tuning_bench', 7), ('async_compile_cache_hit', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:26.0172712Z graph_break [] 2025-12-04T10:01:26.0172840Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:26.0172909Z Autotune Choices Stats: 2025-12-04T10:01:26.0174319Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_726", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:26.0174567Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0174794Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0175106Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0176248Z triton_flex_attention_726 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0177405Z triton_flex_attention_727 0.0123 ms 83.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0178567Z triton_flex_attention_722 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0179688Z triton_flex_attention_724 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:26.0180885Z triton_flex_attention_725 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:26.0182010Z triton_flex_attention_723 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0182263Z SingleProcess AUTOTUNE benchmarking takes 0.2941 seconds and 1.2814 seconds precompiling for 6 choices 2025-12-04T10:01:26.0182332Z Autotune Choices Stats: 2025-12-04T10:01:26.0183777Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_729", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:26.0184216Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0184558Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0185145Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0186331Z triton_flex_attention_backward_729 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0187647Z triton_flex_attention_backward_730 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0188844Z triton_flex_attention_backward_731 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0190038Z triton_flex_attention_backward_733 0.0144 ms 92.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0191196Z triton_flex_attention_backward_728 0.0153 ms 87.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0192362Z triton_flex_attention_backward_735 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0193526Z triton_flex_attention_backward_732 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0194720Z triton_flex_attention_backward_734 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0195878Z triton_flex_attention_backward_737 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0197088Z triton_flex_attention_backward_740 0.0173 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:26.0197343Z SingleProcess AUTOTUNE benchmarking takes 0.6682 seconds and 2.2393 seconds precompiling for 13 choices 2025-12-04T10:01:26.0197540Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:26.0197615Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:26.0197680Z unimplemented [] 2025-12-04T10:01:26.0197780Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:26.0197974Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:26.0199157Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:26.0199228Z graph_break [] 2025-12-04T10:01:26.0199358Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:26.0199424Z Autotune Choices Stats: 2025-12-04T10:01:26.0200832Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_745", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:26.0201076Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0201303Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0201614Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0202792Z triton_flex_attention_745 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0203914Z triton_flex_attention_746 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0205071Z triton_flex_attention_743 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:26.0206212Z triton_flex_attention_741 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0207366Z triton_flex_attention_744 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:26.0208485Z triton_flex_attention_742 0.0164 ms 62.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0208734Z SingleProcess AUTOTUNE benchmarking takes 0.2954 seconds and 1.3187 seconds precompiling for 6 choices 2025-12-04T10:01:26.0208802Z Autotune Choices Stats: 2025-12-04T10:01:26.0210244Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_750", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.01228800043463707, "best_triton_pos": 0} 2025-12-04T10:01:26.0210681Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0211050Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0211604Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0212814Z triton_flex_attention_backward_750 0.0123 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0213971Z triton_flex_attention_backward_748 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0215203Z triton_flex_attention_backward_749 0.0133 ms 92.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0216373Z triton_flex_attention_backward_753 0.0143 ms 85.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0217533Z triton_flex_attention_backward_747 0.0144 ms 85.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0218697Z triton_flex_attention_backward_752 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0219902Z triton_flex_attention_backward_754 0.0154 ms 80.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0221069Z triton_flex_attention_backward_751 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0222260Z triton_flex_attention_backward_756 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0223444Z triton_flex_attention_backward_759 0.0164 ms 75.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:26.0223730Z SingleProcess AUTOTUNE benchmarking takes 0.6710 seconds and 2.3823 seconds precompiling for 13 choices 2025-12-04T10:01:26.0223861Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:26.0223929Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:26.0223996Z unimplemented [] 2025-12-04T10:01:26.0224096Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:26.0224287Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:26.0225474Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:26.0225546Z graph_break [] 2025-12-04T10:01:26.0225684Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:26.0225752Z Autotune Choices Stats: 2025-12-04T10:01:26.0227148Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_765", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010304000228643417, "best_triton_pos": 0} 2025-12-04T10:01:26.0227435Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0227700Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0228011Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0229182Z triton_flex_attention_765 0.0103 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0230342Z triton_flex_attention_764 0.0113 ms 91.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0231496Z triton_flex_attention_762 0.0133 ms 77.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:26.0232646Z triton_flex_attention_760 0.0143 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0233773Z triton_flex_attention_763 0.0143 ms 71.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:26.0234907Z triton_flex_attention_761 0.0154 ms 67.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0235155Z SingleProcess AUTOTUNE benchmarking takes 0.2951 seconds and 1.3301 seconds precompiling for 6 choices 2025-12-04T10:01:26.0235221Z Autotune Choices Stats: 2025-12-04T10:01:26.0236716Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_767", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:26.0237157Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0237494Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0238091Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0239274Z triton_flex_attention_backward_767 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0240527Z triton_flex_attention_backward_769 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0241724Z triton_flex_attention_backward_766 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0242897Z triton_flex_attention_backward_768 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0244055Z triton_flex_attention_backward_771 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0245257Z triton_flex_attention_backward_772 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0246420Z triton_flex_attention_backward_770 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0247618Z triton_flex_attention_backward_773 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0248814Z triton_flex_attention_backward_775 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0250026Z triton_flex_attention_backward_778 0.0174 ms 76.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:26.0250275Z SingleProcess AUTOTUNE benchmarking takes 0.6693 seconds and 2.2444 seconds precompiling for 13 choices 2025-12-04T10:01:26.0250407Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:26.0250477Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:26.0250551Z unimplemented [] 2025-12-04T10:01:26.0250653Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:26.0250843Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:26.0252036Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:26.0252097Z graph_break [] 2025-12-04T10:01:26.0252231Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:26.0252297Z Autotune Choices Stats: 2025-12-04T10:01:26.0253749Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_783", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:26.0253998Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0254221Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0254535Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0255996Z triton_flex_attention_783 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0257188Z triton_flex_attention_784 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0258364Z triton_flex_attention_779 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0259490Z triton_flex_attention_781 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:26.0260626Z triton_flex_attention_782 0.0154 ms 73.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:26.0261750Z triton_flex_attention_780 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0262002Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3189 seconds precompiling for 6 choices 2025-12-04T10:01:26.0262068Z Autotune Choices Stats: 2025-12-04T10:01:26.0263567Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_786", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:26.0264054Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0264390Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0264956Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0266173Z triton_flex_attention_backward_786 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0267449Z triton_flex_attention_backward_787 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0268609Z triton_flex_attention_backward_788 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0269776Z triton_flex_attention_backward_785 0.0145 ms 92.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0270930Z triton_flex_attention_backward_790 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0272138Z triton_flex_attention_backward_791 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0273304Z triton_flex_attention_backward_792 0.0155 ms 86.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0274508Z triton_flex_attention_backward_789 0.0164 ms 81.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0275740Z triton_flex_attention_backward_794 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0276896Z triton_flex_attention_backward_797 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:26.0277151Z SingleProcess AUTOTUNE benchmarking takes 0.6703 seconds and 2.2711 seconds precompiling for 13 choices 2025-12-04T10:01:26.0277282Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:26.0277350Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:26.0277431Z unimplemented [] 2025-12-04T10:01:26.0277539Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:26.0277733Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:26.0278924Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:26.0278990Z graph_break [] 2025-12-04T10:01:26.0279120Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:26.0279186Z Autotune Choices Stats: 2025-12-04T10:01:26.0280623Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_803", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:26.0280909Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0281135Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0281449Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0282591Z triton_flex_attention_803 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0285249Z triton_flex_attention_802 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0288421Z triton_flex_attention_800 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:26.0290808Z triton_flex_attention_798 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0293128Z triton_flex_attention_801 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:26.0295520Z triton_flex_attention_799 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0297026Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.2928 seconds precompiling for 6 choices 2025-12-04T10:01:26.0297482Z Autotune Choices Stats: 2025-12-04T10:01:26.0299385Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_806", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:26.0301478Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0302320Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0303323Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0305185Z triton_flex_attention_backward_806 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0307659Z triton_flex_attention_backward_805 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0310050Z triton_flex_attention_backward_807 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0312441Z triton_flex_attention_backward_804 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0315027Z triton_flex_attention_backward_809 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0317427Z triton_flex_attention_backward_810 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0319860Z triton_flex_attention_backward_811 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0322294Z triton_flex_attention_backward_808 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0324716Z triton_flex_attention_backward_812 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0327111Z triton_flex_attention_backward_813 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0328615Z SingleProcess AUTOTUNE benchmarking takes 0.6698 seconds and 2.2839 seconds precompiling for 13 choices 2025-12-04T10:01:26.0329081Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:26.0329369Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:26.0329565Z unimplemented [] 2025-12-04T10:01:26.0329775Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:26.0330146Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:26.0331636Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:26.0332950Z graph_break [] 2025-12-04T10:01:26.0333189Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:26.0333469Z Autotune Choices Stats: 2025-12-04T10:01:26.0334999Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_821", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:26.0336757Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0337298Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0337917Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0339488Z triton_flex_attention_821 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0341855Z triton_flex_attention_822 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0344169Z triton_flex_attention_817 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0346477Z triton_flex_attention_819 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:26.0348885Z triton_flex_attention_820 0.0154 ms 66.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:26.0351232Z triton_flex_attention_818 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0352680Z SingleProcess AUTOTUNE benchmarking takes 0.2934 seconds and 1.3176 seconds precompiling for 6 choices 2025-12-04T10:01:26.0353121Z Autotune Choices Stats: 2025-12-04T10:01:26.0354683Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_825", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:26.0356918Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0357830Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0358848Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0360671Z triton_flex_attention_backward_825 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0363080Z triton_flex_attention_backward_824 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0365467Z triton_flex_attention_backward_826 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0367918Z triton_flex_attention_backward_823 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0370312Z triton_flex_attention_backward_828 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0372749Z triton_flex_attention_backward_827 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0375177Z triton_flex_attention_backward_829 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0377607Z triton_flex_attention_backward_830 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0379995Z triton_flex_attention_backward_832 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0382383Z triton_flex_attention_backward_835 0.0164 ms 81.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:26.0383877Z SingleProcess AUTOTUNE benchmarking takes 0.6673 seconds and 2.2875 seconds precompiling for 13 choices 2025-12-04T10:01:26.0384337Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:26.0384623Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:26.0384819Z unimplemented [] 2025-12-04T10:01:26.0385021Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:26.0385386Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:26.0386895Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:26.0388273Z graph_break [] 2025-12-04T10:01:26.0388494Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:26.0388818Z Autotune Choices Stats: 2025-12-04T10:01:26.0390342Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_840", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:26.0392052Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0392591Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0393270Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0394800Z triton_flex_attention_840 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0397136Z triton_flex_attention_841 0.0113 ms 90.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0399450Z triton_flex_attention_836 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0401766Z triton_flex_attention_838 0.0133 ms 76.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:26.0404125Z triton_flex_attention_839 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:26.0406474Z triton_flex_attention_837 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0407946Z SingleProcess AUTOTUNE benchmarking takes 0.2950 seconds and 1.3350 seconds precompiling for 6 choices 2025-12-04T10:01:26.0408333Z Autotune Choices Stats: 2025-12-04T10:01:26.0409923Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_843", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:26.0411898Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0412741Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0413710Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0415523Z triton_flex_attention_backward_843 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0417934Z triton_flex_attention_backward_844 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0420399Z triton_flex_attention_backward_845 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0422803Z triton_flex_attention_backward_842 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0425240Z triton_flex_attention_backward_847 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0428056Z triton_flex_attention_backward_846 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0430584Z triton_flex_attention_backward_848 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0432972Z triton_flex_attention_backward_849 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0435369Z triton_flex_attention_backward_851 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0437763Z triton_flex_attention_backward_850 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0439242Z SingleProcess AUTOTUNE benchmarking takes 0.6676 seconds and 2.3506 seconds precompiling for 13 choices 2025-12-04T10:01:26.0439700Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:26.0440016Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:26.0440212Z unimplemented [] 2025-12-04T10:01:26.0440413Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:26.0440771Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:26.0442225Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 26), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 7), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 4), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:26.0443580Z graph_break [] 2025-12-04T10:01:26.0443813Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:26.0444087Z Autotune Choices Stats: 2025-12-04T10:01:26.0445651Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_859", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:26.0447401Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0447938Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0448561Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0450095Z triton_flex_attention_859 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0452414Z triton_flex_attention_860 0.0122 ms 92.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0454724Z triton_flex_attention_857 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:26.0457554Z triton_flex_attention_858 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:26.0459893Z triton_flex_attention_855 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0462262Z triton_flex_attention_856 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0463708Z SingleProcess AUTOTUNE benchmarking takes 0.2946 seconds and 1.3085 seconds precompiling for 6 choices 2025-12-04T10:01:26.0464091Z Autotune Choices Stats: 2025-12-04T10:01:26.0465713Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_862", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:26.0467807Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0468651Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0469639Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0471466Z triton_flex_attention_backward_862 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0473881Z triton_flex_attention_backward_863 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0476324Z triton_flex_attention_backward_864 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0478724Z triton_flex_attention_backward_861 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0481202Z triton_flex_attention_backward_865 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0483638Z triton_flex_attention_backward_866 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0486088Z triton_flex_attention_backward_868 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0488484Z triton_flex_attention_backward_867 0.0154 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0490873Z triton_flex_attention_backward_870 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0493293Z triton_flex_attention_backward_869 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0494783Z SingleProcess AUTOTUNE benchmarking takes 0.6670 seconds and 2.3594 seconds precompiling for 13 choices 2025-12-04T10:01:26.0495245Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:26.0495526Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:26.0495716Z unimplemented [] 2025-12-04T10:01:26.0495923Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:26.0496340Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:26.0497786Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:26.0499106Z graph_break [] 2025-12-04T10:01:26.0499335Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:26.0499608Z Autotune Choices Stats: 2025-12-04T10:01:26.0501156Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_878", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:26.0502904Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0503445Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0504059Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0505592Z triton_flex_attention_878 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0508044Z triton_flex_attention_879 0.0123 ms 91.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0510400Z triton_flex_attention_874 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0512719Z triton_flex_attention_876 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:26.0515085Z triton_flex_attention_877 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:26.0517410Z triton_flex_attention_875 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0518928Z SingleProcess AUTOTUNE benchmarking takes 0.2950 seconds and 1.3095 seconds precompiling for 6 choices 2025-12-04T10:01:26.0519320Z Autotune Choices Stats: 2025-12-04T10:01:26.0520879Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_880", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4", "best_time": 0.014336000196635723, "best_triton_pos": 0} 2025-12-04T10:01:26.0522824Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0523660Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0524632Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0526461Z triton_flex_attention_backward_880 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0528904Z triton_flex_attention_backward_881 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0531312Z triton_flex_attention_backward_882 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0534029Z triton_flex_attention_backward_883 0.0143 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0536456Z triton_flex_attention_backward_885 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0538881Z triton_flex_attention_backward_886 0.0154 ms 93.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0541298Z triton_flex_attention_backward_884 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0543696Z triton_flex_attention_backward_887 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0546095Z triton_flex_attention_backward_889 0.0164 ms 87.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0548622Z triton_flex_attention_backward_888 0.0174 ms 82.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0550165Z SingleProcess AUTOTUNE benchmarking takes 0.6696 seconds and 2.3839 seconds precompiling for 13 choices 2025-12-04T10:01:26.0550625Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:26.0550913Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:26.0551111Z unimplemented [] 2025-12-04T10:01:26.0551312Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:26.0551681Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:26.0553147Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 28), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('benchmarking.InductorBenchmarker.benchmark', 9), ('async_compile_cache_hit', 6), ('coordesc_tuning_bench', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:26.0554529Z graph_break [] 2025-12-04T10:01:26.0554788Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:26.0555068Z Autotune Choices Stats: 2025-12-04T10:01:26.0556890Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_897", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011264000087976456, "best_triton_pos": 0} 2025-12-04T10:01:26.0558608Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0559158Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0559764Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0561295Z triton_flex_attention_897 0.0113 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0563608Z triton_flex_attention_898 0.0121 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0565994Z triton_flex_attention_893 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0568324Z triton_flex_attention_895 0.0133 ms 84.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:26.0570705Z triton_flex_attention_896 0.0143 ms 78.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:26.0573076Z triton_flex_attention_894 0.0164 ms 68.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0574562Z SingleProcess AUTOTUNE benchmarking takes 0.2949 seconds and 1.3269 seconds precompiling for 6 choices 2025-12-04T10:01:26.0574951Z Autotune Choices Stats: 2025-12-04T10:01:26.0576503Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_902", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:26.0578459Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0579301Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0580271Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0582141Z triton_flex_attention_backward_902 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0584552Z triton_flex_attention_backward_900 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0586983Z triton_flex_attention_backward_901 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0589497Z triton_flex_attention_backward_904 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0591925Z triton_flex_attention_backward_899 0.0145 ms 91.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0594327Z triton_flex_attention_backward_905 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0596735Z triton_flex_attention_backward_906 0.0162 ms 82.1% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0599127Z triton_flex_attention_backward_903 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0601547Z triton_flex_attention_backward_908 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0603944Z triton_flex_attention_backward_907 0.0174 ms 76.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0605456Z SingleProcess AUTOTUNE benchmarking takes 0.6679 seconds and 2.3006 seconds precompiling for 13 choices 2025-12-04T10:01:26.0605919Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:26.0606197Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:26.0606407Z unimplemented [] 2025-12-04T10:01:26.0606615Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:26.0606978Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:26.0608472Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:26.0609818Z graph_break [] 2025-12-04T10:01:26.0610049Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:26.0610335Z Autotune Choices Stats: 2025-12-04T10:01:26.0611868Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_916", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.010239999741315842, "best_triton_pos": 0} 2025-12-04T10:01:26.0613580Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0614125Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0614741Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0616282Z triton_flex_attention_916 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0618659Z triton_flex_attention_917 0.0102 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0621181Z triton_flex_attention_914 0.0133 ms 76.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:26.0623573Z triton_flex_attention_912 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0625915Z triton_flex_attention_915 0.0143 ms 71.4% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:26.0628340Z triton_flex_attention_913 0.0164 ms 62.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0629784Z SingleProcess AUTOTUNE benchmarking takes 0.2942 seconds and 1.3079 seconds precompiling for 6 choices 2025-12-04T10:01:26.0630171Z Autotune Choices Stats: 2025-12-04T10:01:26.0631738Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_919", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.013407999649643898, "best_triton_pos": 0} 2025-12-04T10:01:26.0633685Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0634533Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0635504Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0637380Z triton_flex_attention_backward_919 0.0134 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0639779Z triton_flex_attention_backward_918 0.0143 ms 93.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0642200Z triton_flex_attention_backward_920 0.0143 ms 93.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0644681Z triton_flex_attention_backward_921 0.0143 ms 93.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0647073Z triton_flex_attention_backward_923 0.0154 ms 87.3% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0649458Z triton_flex_attention_backward_922 0.0164 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0651849Z triton_flex_attention_backward_924 0.0164 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0654279Z triton_flex_attention_backward_925 0.0164 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0656965Z triton_flex_attention_backward_927 0.0164 ms 81.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0659434Z triton_flex_attention_backward_926 0.0174 ms 77.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0660913Z SingleProcess AUTOTUNE benchmarking takes 0.6696 seconds and 2.2834 seconds precompiling for 13 choices 2025-12-04T10:01:26.0661371Z ----------------------------- Captured stdout call ----------------------------- 2025-12-04T10:01:26.0661656Z frames [('total', 2), ('ok', 2)] 2025-12-04T10:01:26.0661912Z unimplemented [] 2025-12-04T10:01:26.0662164Z stats [('calls_captured', 105), ('unique_graphs', 2)] 2025-12-04T10:01:26.0662532Z aot_autograd [('total', 2), ('autograd_cache_miss', 2), ('autograd_cache_saved', 2), ('ok', 2)] 2025-12-04T10:01:26.0663987Z inductor [('triton_bundler_save_kernel', 232), ('async_compile_cache_miss', 35), ('benchmarking.InductorBenchmarker.benchmark_gpu', 25), ('select_algorithm_num_precompiles', 19), ('pattern_matcher_nodes', 16), ('pattern_matcher_count', 14), ('async_compile_cache_hit', 6), ('benchmarking.InductorBenchmarker.benchmark', 6), ('fxgraph_cache_miss', 3), ('triton_bundler_save_static_autotuner', 3), ('coordesc_tuning_bench', 3), ('select_algorithm_precompile', 2), ('select_algorithm_autotune', 2)] 2025-12-04T10:01:26.0665293Z graph_break [] 2025-12-04T10:01:26.0665525Z ----------------------------- Captured stderr call ----------------------------- 2025-12-04T10:01:26.0665820Z Autotune Choices Stats: 2025-12-04T10:01:26.0667409Z {"num_choices": 6, "num_triton_choices": 6, "best_kernel": "triton_flex_attention_936", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4", "best_time": 0.011103999800980091, "best_triton_pos": 0} 2025-12-04T10:01:26.0669125Z AUTOTUNE flex_attention(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0669661Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0670279Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0671886Z triton_flex_attention_936 0.0111 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0674219Z triton_flex_attention_935 0.0113 ms 98.6% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=64, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0676584Z triton_flex_attention_931 0.0143 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0678929Z triton_flex_attention_933 0.0143 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=8 2025-12-04T10:01:26.0681288Z triton_flex_attention_934 0.0143 ms 77.5% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=8 2025-12-04T10:01:26.0683611Z triton_flex_attention_932 0.0164 ms 67.8% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M=128, BLOCK_N=128, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, USE_TMA=False, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0685076Z SingleProcess AUTOTUNE benchmarking takes 0.2936 seconds and 1.3084 seconds precompiling for 6 choices 2025-12-04T10:01:26.0685466Z Autotune Choices Stats: 2025-12-04T10:01:26.0687022Z {"num_choices": 13, "num_triton_choices": 13, "best_kernel": "triton_flex_attention_backward_940", "best_kernel_desc": "BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION=\"'tf32'\", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4", "best_time": 0.013311999849975109, "best_triton_pos": 0} 2025-12-04T10:01:26.0688973Z AUTOTUNE flex_attention_backward(1x2x128x64, 1x2x128x64, 1x2x128x64, 1x2x128, 1x2x128, 1x2x128x64, 1x2x128x64, 1x2x128x64, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1, 1x1x1, 1x1x1x1) 2025-12-04T10:01:26.0689860Z strides: [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [256, 128, 1], [256, 128, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [16384, 8192, 64, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1], [1, 1, 1], [1, 1, 1, 1] 2025-12-04T10:01:26.0690828Z dtypes: torch.float16, torch.float16, torch.float16, torch.float32, torch.float32, torch.float16, torch.float16, torch.float16, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32, torch.int32 2025-12-04T10:01:26.0692665Z triton_flex_attention_backward_940 0.0133 ms 100.0% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0695104Z triton_flex_attention_backward_938 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0697525Z triton_flex_attention_backward_939 0.0143 ms 92.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0699943Z triton_flex_attention_backward_937 0.0153 ms 87.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0702341Z triton_flex_attention_backward_942 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0704742Z triton_flex_attention_backward_944 0.0154 ms 86.7% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=5, num_warps=4 2025-12-04T10:01:26.0707191Z triton_flex_attention_backward_941 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=1, num_warps=4 2025-12-04T10:01:26.0709655Z triton_flex_attention_backward_943 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=4, num_warps=4 2025-12-04T10:01:26.0712091Z triton_flex_attention_backward_946 0.0164 ms 81.2% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=3, num_warps=4 2025-12-04T10:01:26.0714520Z triton_flex_attention_backward_949 0.0164 ms 80.9% BLOCKS_ARE_CONTIGUOUS=False, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, FLOAT32_PRECISION="'tf32'", GQA_SHARED_HEADS=1, HAS_FULL_BLOCKS=True, IS_DIVISIBLE=True, OUTPUT_LOGSUMEXP=True, OUTPUT_MAX=False, PRESCALE_QK=False, QK_HEAD_DIM=64, QK_HEAD_DIM_ROUNDED=64, ROWS_GUARANTEED_SAFE=False, SAFE_HEAD_DIM=True, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, V_HEAD_DIM=64, V_HEAD_DIM_ROUNDED=64, WRITE_DQ=True, num_stages=2, num_warps=4 2025-12-04T10:01:26.0716038Z SingleProcess AUTOTUNE benchmarking takes 0.6679 seconds and 2.3077 seconds precompiling for 13 choices 2025-12-04T10:01:26.0716810Z - generated xml file: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-8842d0c0a55c3e44.xml - 2025-12-04T10:01:26.0717444Z =========================== short test summary info ============================ 2025-12-04T10:01:26.0718217Z FAILED [8.5319s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpb0gwypls/flex_attention_configs.json was not created 2025-12-04T10:01:26.0718837Z 2025-12-04T10:01:26.0718979Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0719436Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0719773Z 2025-12-04T10:01:26.0719938Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0720724Z FAILED [8.4669s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp8wnpqbue/flex_attention_configs.json was not created 2025-12-04T10:01:26.0721337Z 2025-12-04T10:01:26.0721475Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0721941Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0722271Z 2025-12-04T10:01:26.0722433Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0723205Z FAILED [8.3721s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp_npbvc1j/flex_attention_configs.json was not created 2025-12-04T10:01:26.0723816Z 2025-12-04T10:01:26.0723948Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0724394Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0724726Z 2025-12-04T10:01:26.0724881Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0725736Z FAILED [8.5757s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpk9u0gh1c/flex_attention_configs.json was not created 2025-12-04T10:01:26.0726360Z 2025-12-04T10:01:26.0726491Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0726938Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0727261Z 2025-12-04T10:01:26.0727416Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0728228Z FAILED [8.4485s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpa7c6r43z/flex_attention_configs.json was not created 2025-12-04T10:01:26.0728844Z 2025-12-04T10:01:26.0728968Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0729415Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0729745Z 2025-12-04T10:01:26.0729902Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0730677Z FAILED [8.5375s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp2rqaxu_5/flex_attention_configs.json was not created 2025-12-04T10:01:26.0731302Z 2025-12-04T10:01:26.0731430Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0731948Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0732275Z 2025-12-04T10:01:26.0732427Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0733355Z FAILED [8.6499s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpty_oqglf/flex_attention_configs.json was not created 2025-12-04T10:01:26.0733966Z 2025-12-04T10:01:26.0734089Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0734536Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0734859Z 2025-12-04T10:01:26.0735019Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0735786Z FAILED [8.5761s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp46qub0vx/flex_attention_configs.json was not created 2025-12-04T10:01:26.0736411Z 2025-12-04T10:01:26.0736533Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0736982Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0737304Z 2025-12-04T10:01:26.0737461Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0738217Z FAILED [8.5494s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp7_xq874k/flex_attention_configs.json was not created 2025-12-04T10:01:26.0738824Z 2025-12-04T10:01:26.0738944Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0739384Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0739710Z 2025-12-04T10:01:26.0739868Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0740631Z FAILED [8.5124s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp93xtstbc/flex_attention_configs.json was not created 2025-12-04T10:01:26.0741244Z 2025-12-04T10:01:26.0741410Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0741858Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0742179Z 2025-12-04T10:01:26.0742338Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0743114Z FAILED [8.7232s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp62tcpu7h/flex_attention_configs.json was not created 2025-12-04T10:01:26.0743767Z 2025-12-04T10:01:26.0743891Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0744338Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0744658Z 2025-12-04T10:01:26.0744815Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0745590Z FAILED [8.6751s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpe81b4lib/flex_attention_configs.json was not created 2025-12-04T10:01:26.0746198Z 2025-12-04T10:01:26.0746328Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0746769Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0747098Z 2025-12-04T10:01:26.0747296Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0748160Z FAILED [8.5914s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp5am0ftj2/flex_attention_configs.json was not created 2025-12-04T10:01:26.0748777Z 2025-12-04T10:01:26.0748899Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0749353Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0749686Z 2025-12-04T10:01:26.0749837Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0750603Z FAILED [8.6361s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmplcc40_74/flex_attention_configs.json was not created 2025-12-04T10:01:26.0751216Z 2025-12-04T10:01:26.0751343Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0751784Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0752118Z 2025-12-04T10:01:26.0752271Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0753047Z FAILED [8.9408s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpm3w3a3p6/flex_attention_configs.json was not created 2025-12-04T10:01:26.0753660Z 2025-12-04T10:01:26.0753789Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0754233Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0754564Z 2025-12-04T10:01:26.0754717Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0755761Z FAILED [8.3559s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpxg4d7rlo/flex_attention_configs.json was not created 2025-12-04T10:01:26.0756384Z 2025-12-04T10:01:26.0756517Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0756959Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0757293Z 2025-12-04T10:01:26.0757558Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0758347Z FAILED [8.5935s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp51t4iifl/flex_attention_configs.json was not created 2025-12-04T10:01:26.0758970Z 2025-12-04T10:01:26.0759098Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0759543Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0759925Z 2025-12-04T10:01:26.0760084Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0760857Z FAILED [8.6195s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp5bnad5h5/flex_attention_configs.json was not created 2025-12-04T10:01:26.0761470Z 2025-12-04T10:01:26.0761599Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0762049Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0762371Z 2025-12-04T10:01:26.0762524Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0763310Z FAILED [8.5897s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpnnpp2jxf/flex_attention_configs.json was not created 2025-12-04T10:01:26.0763929Z 2025-12-04T10:01:26.0764111Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0764603Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0764936Z 2025-12-04T10:01:26.0765097Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0765874Z FAILED [8.8642s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp9e0yqvfi/flex_attention_configs.json was not created 2025-12-04T10:01:26.0766489Z 2025-12-04T10:01:26.0766608Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0767053Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0767371Z 2025-12-04T10:01:26.0767524Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0768295Z FAILED [8.9276s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpfs0cn7zn/flex_attention_configs.json was not created 2025-12-04T10:01:26.0768912Z 2025-12-04T10:01:26.0769033Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0769479Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0769802Z 2025-12-04T10:01:26.0769959Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0770716Z FAILED [8.7248s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp_nrs4kuo/flex_attention_configs.json was not created 2025-12-04T10:01:26.0771324Z 2025-12-04T10:01:26.0771445Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0771884Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0772208Z 2025-12-04T10:01:26.0772369Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0773145Z FAILED [8.6091s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpicavptze/flex_attention_configs.json was not created 2025-12-04T10:01:26.0773807Z 2025-12-04T10:01:26.0773935Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0774376Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0774697Z 2025-12-04T10:01:26.0774852Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0775615Z FAILED [8.8935s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpzc1j4inl/flex_attention_configs.json was not created 2025-12-04T10:01:26.0776270Z 2025-12-04T10:01:26.0776395Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0776842Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0777163Z 2025-12-04T10:01:26.0777322Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0778086Z FAILED [8.7682s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpexqsmcb8/flex_attention_configs.json was not created 2025-12-04T10:01:26.0778703Z 2025-12-04T10:01:26.0778826Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0779271Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0779593Z 2025-12-04T10:01:26.0779753Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0780605Z FAILED [8.6099s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpbvb4g54s/flex_attention_configs.json was not created 2025-12-04T10:01:26.0781217Z 2025-12-04T10:01:26.0781339Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0781784Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0782115Z 2025-12-04T10:01:26.0782268Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0783031Z FAILED [8.7270s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpq9_kmn9n/flex_attention_configs.json was not created 2025-12-04T10:01:26.0783638Z 2025-12-04T10:01:26.0783763Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0784207Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0784535Z 2025-12-04T10:01:26.0784689Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0785465Z FAILED [8.8851s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpr95ztumr/flex_attention_configs.json was not created 2025-12-04T10:01:26.0786075Z 2025-12-04T10:01:26.0786202Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0786642Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0786969Z 2025-12-04T10:01:26.0787123Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0788004Z FAILED [8.5053s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpi2u13ooi/flex_attention_configs.json was not created 2025-12-04T10:01:26.0788616Z 2025-12-04T10:01:26.0788746Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0789191Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0789523Z 2025-12-04T10:01:26.0789729Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0790502Z FAILED [8.8934s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpaff1vcq5/flex_attention_configs.json was not created 2025-12-04T10:01:26.0791112Z 2025-12-04T10:01:26.0791240Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0791678Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0792051Z 2025-12-04T10:01:26.0792210Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0792976Z FAILED [8.9631s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpr5b4038i/flex_attention_configs.json was not created 2025-12-04T10:01:26.0793584Z 2025-12-04T10:01:26.0793713Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0794151Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0794482Z 2025-12-04T10:01:26.0794634Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0795407Z FAILED [8.7855s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpx8acg7t9/flex_attention_configs.json was not created 2025-12-04T10:01:26.0796058Z 2025-12-04T10:01:26.0796220Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0796675Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0796995Z 2025-12-04T10:01:26.0797152Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0797924Z FAILED [8.6276s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpbzs82aeu/flex_attention_configs.json was not created 2025-12-04T10:01:26.0798540Z 2025-12-04T10:01:26.0798665Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0799105Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0799436Z 2025-12-04T10:01:26.0799588Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0800363Z FAILED [8.8305s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp0t2ys9b1/flex_attention_configs.json was not created 2025-12-04T10:01:26.0800978Z 2025-12-04T10:01:26.0801099Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0801544Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0801864Z 2025-12-04T10:01:26.0802016Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0802785Z FAILED [8.8622s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpv0q9k1ov/flex_attention_configs.json was not created 2025-12-04T10:01:26.0803401Z 2025-12-04T10:01:26.0803522Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0803978Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0804300Z 2025-12-04T10:01:26.0804457Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0805260Z FAILED [8.5916s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp49gtu2vp/flex_attention_configs.json was not created 2025-12-04T10:01:26.0805876Z 2025-12-04T10:01:26.0805998Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0806443Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0806763Z 2025-12-04T10:01:26.0806923Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0807688Z FAILED [8.5865s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpvow2h57n/flex_attention_configs.json was not created 2025-12-04T10:01:26.0808342Z 2025-12-04T10:01:26.0808465Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0808908Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0809238Z 2025-12-04T10:01:26.0809402Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0810164Z FAILED [9.0100s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpd6qqrq76/flex_attention_configs.json was not created 2025-12-04T10:01:26.0810777Z 2025-12-04T10:01:26.0810898Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0811348Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0811727Z 2025-12-04T10:01:26.0811917Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0812684Z FAILED [8.9863s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp80jrgwb0/flex_attention_configs.json was not created 2025-12-04T10:01:26.0813313Z 2025-12-04T10:01:26.0813439Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0813892Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0814218Z 2025-12-04T10:01:26.0814382Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0815155Z FAILED [8.5117s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpdbsr3fy2/flex_attention_configs.json was not created 2025-12-04T10:01:26.0815774Z 2025-12-04T10:01:26.0815899Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0816350Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0816670Z 2025-12-04T10:01:26.0816842Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0817623Z FAILED [8.8083s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpfozg11dp/flex_attention_configs.json was not created 2025-12-04T10:01:26.0818233Z 2025-12-04T10:01:26.0818359Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0818805Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0819136Z 2025-12-04T10:01:26.0819290Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0820068Z FAILED [8.7107s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpokaaz2b9/flex_attention_configs.json was not created 2025-12-04T10:01:26.0820680Z 2025-12-04T10:01:26.0820818Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0821306Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0821642Z 2025-12-04T10:01:26.0821795Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0822565Z FAILED [8.7130s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpxa3ik349/flex_attention_configs.json was not created 2025-12-04T10:01:26.0823175Z 2025-12-04T10:01:26.0823304Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0823798Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0824126Z 2025-12-04T10:01:26.0824279Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0825052Z FAILED [8.7673s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpkgut0tc3/flex_attention_configs.json was not created 2025-12-04T10:01:26.0825659Z 2025-12-04T10:01:26.0825788Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0826227Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0826558Z 2025-12-04T10:01:26.0826713Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0827582Z FAILED [8.9727s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmp7g4b1x1n/flex_attention_configs.json was not created 2025-12-04T10:01:26.0828255Z 2025-12-04T10:01:26.0828386Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0828826Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0829156Z 2025-12-04T10:01:26.0829319Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0830092Z FAILED [8.8543s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpxbv7srfc/flex_attention_configs.json was not created 2025-12-04T10:01:26.0830701Z 2025-12-04T10:01:26.0830830Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0831274Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0831605Z 2025-12-04T10:01:26.0831762Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0832533Z FAILED [9.1040s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpehryl9m1/flex_attention_configs.json was not created 2025-12-04T10:01:26.0833144Z 2025-12-04T10:01:26.0833273Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0833713Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0834036Z 2025-12-04T10:01:26.0834188Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0834958Z FAILED [8.7166s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpenzy7uo9/flex_attention_configs.json was not created 2025-12-04T10:01:26.0835576Z 2025-12-04T10:01:26.0835700Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0836152Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0836479Z 2025-12-04T10:01:26.0836637Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0837470Z FAILED [8.5727s] inductor/test_flex_attention.py::TestLearnableBiasesCUDA::test_flex_attention_logging_cuda - AssertionError: False is not true : Log file /tmp/tmpf8ob5xno/flex_attention_configs.json was not created 2025-12-04T10:01:26.0838087Z 2025-12-04T10:01:26.0838209Z To execute this test, run the following from the base repo dir: 2025-12-04T10:01:26.0838659Z python test/inductor/test_flex_attention.py TestLearnableBiasesCUDA.test_flex_attention_logging_cuda 2025-12-04T10:01:26.0838982Z 2025-12-04T10:01:26.0839137Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:01:26.0839523Z =================== 49 failed, 1 passed in 438.70s (0:07:18) =================== 2025-12-04T10:01:26.0839728Z 2025-12-04T10:01:26.0840071Z FINISHED PRINTING LOG FILE of inductor/test_flex_attention 1/6 (test/test-reports/inductor.test_flex_attention_1.6_ddac0a72250f3643_.log) 2025-12-04T10:01:26.0840483Z 2025-12-04T10:01:26.0840711Z Finished inductor/test_flex_attention 1/6 ... [2025-12-04 10:01:21.921387][1358.555854062], took 7.46min 2025-12-04T10:01:26.0841486Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-8842d0c0a55c3e44.xml 2025-12-04T10:01:26.0842114Z Uploading logs for 57120265563 to S3 2025-12-04T10:01:26.0842340Z Uploading artifacts took 1.27 seconds 2025-12-04T10:01:26.0842570Z inductor/test_flex_attention 1/6 failed! 2025-12-04T10:01:26.0847997Z Running inductor/test_flex_attention 3/6 ... [2025-12-04 10:01:23.593229][1360.227705217] 2025-12-04T10:01:26.0848422Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:01:26.0849443Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_flex_attention.py', '--shard-id=3', '--num-shards=6', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:01:23.593582] 2025-12-04T10:01:30.2798955Z 2025-12-04T10:01:30.2799987Z inductor/test_flex_attention 3/6 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_flex_attention_3.6_66a4e481ecf1862e_.log 2025-12-04T10:01:30.2800887Z Running 0 items in this shard: 2025-12-04T10:01:30.2801027Z 2025-12-04T10:01:30.2801265Z Finished inductor/test_flex_attention 3/6 ... [2025-12-04 10:01:30.279794][1366.914271318], took 0.11min 2025-12-04T10:01:30.2811012Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-df8e471be02986ee.xml 2025-12-04T10:01:30.3574620Z Running inductor/test_flex_attention 4/6 ... [2025-12-04 10:01:30.357203][1366.991682269] 2025-12-04T10:01:30.3575367Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:01:30.3578410Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_flex_attention.py', '--shard-id=4', '--num-shards=6', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:01:30.357514] 2025-12-04T10:01:36.9746737Z 2025-12-04T10:01:36.9748164Z inductor/test_flex_attention 4/6 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_flex_attention_4.6_e5d890032a85dd23_.log 2025-12-04T10:01:36.9749047Z Running 0 items in this shard: 2025-12-04T10:01:36.9749218Z 2025-12-04T10:01:36.9749503Z Finished inductor/test_flex_attention 4/6 ... [2025-12-04 10:01:36.974441][1373.608918565], took 0.11min 2025-12-04T10:01:36.9760249Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-de4e116bf43af918.xml 2025-12-04T10:01:37.0529977Z Running inductor/test_flex_attention 5/6 ... [2025-12-04 10:01:37.052729][1373.687208004] 2025-12-04T10:01:37.0530739Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:01:37.0533878Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_flex_attention.py', '--shard-id=5', '--num-shards=6', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:01:37.053040] 2025-12-04T10:01:43.6799895Z 2025-12-04T10:01:43.6800757Z inductor/test_flex_attention 5/6 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_flex_attention_5.6_2f9a5215a30f13bf_.log 2025-12-04T10:01:43.6801521Z Running 0 items in this shard: 2025-12-04T10:01:43.6801938Z 2025-12-04T10:01:43.6802264Z Finished inductor/test_flex_attention 5/6 ... [2025-12-04 10:01:43.679756][1380.314233053], took 0.11min 2025-12-04T10:01:43.6814023Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-2a982e23b7b97d08.xml 2025-12-04T10:01:43.7568725Z Running inductor/test_flex_attention 6/6 ... [2025-12-04 10:01:43.756618][1380.391097771] 2025-12-04T10:01:43.7569202Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:01:43.7572271Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_flex_attention.py', '--shard-id=6', '--num-shards=6', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:01:43.756931] 2025-12-04T10:03:19.8369815Z 2025-12-04T10:03:19.8370947Z inductor/test_flex_attention 6/6 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_flex_attention_6.6_5a3a1f34f66362bd_.log 2025-12-04T10:03:19.8409467Z Running 100 items in this shard: test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_flex_attention_stride_ordering_mode_paged_attention_permute_order3_shape1_cuda, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16, test/inductor/test_flex_attention.py::TestFlexAttentionCUDA::test_return_aux_deprecation_warnings_cuda_float16 2025-12-04T10:03:19.8445597Z 2025-12-04T10:03:19.8445828Z Finished inductor/test_flex_attention 6/6 ... [2025-12-04 10:03:19.836972][1476.471443087], took 1.60min 2025-12-04T10:03:19.8446627Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-175ab23e93e8bbac.xml 2025-12-04T10:03:19.9275606Z Running test_privateuseone_python_backend 1/1 ... [2025-12-04 10:03:19.927289][1476.561765225] 2025-12-04T10:03:19.9276104Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:03:19.9278996Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_privateuseone_python_backend.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:03:19.927611] 2025-12-04T10:03:22.7747115Z 2025-12-04T10:03:22.7748144Z test_privateuseone_python_backend 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_privateuseone_python_backend_1.1_e6e06e88a3ef7cfe_.log 2025-12-04T10:03:22.7748965Z Running 0 items in this shard: 2025-12-04T10:03:22.7749144Z 2025-12-04T10:03:22.7749459Z Finished test_privateuseone_python_backend 1/1 ... [2025-12-04 10:03:22.774512][1479.408989214], took 0.05min 2025-12-04T10:03:22.7765602Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_privateuseone_python_backend/test_privateuseone_python_backend-c28ba098140a7833.xml 2025-12-04T10:03:22.8245944Z Running test_ci_sanity_check_fail 1/1 ... [2025-12-04 10:03:22.824357][1479.458835644] 2025-12-04T10:03:22.8246309Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:03:22.8249280Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_ci_sanity_check_fail.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:03:22.824673] 2025-12-04T10:03:25.6600709Z Finished test_ci_sanity_check_fail 1/1 ... [2025-12-04 10:03:25.659667][1482.294135056], took 0.05min 2025-12-04T10:03:25.6618466Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_ci_sanity_check_fail/test_ci_sanity_check_fail-09b2f72c46f7df3f.xml 2025-12-04T10:03:26.8573309Z Uploading logs for 57120265563 to S3 2025-12-04T10:03:27.0171882Z Uploading artifacts took 1.32 seconds 2025-12-04T10:03:27.0172232Z test_ci_sanity_check_fail 1/1 failed! 2025-12-04T10:03:27.0175032Z Running test_overrides 1/1 ... [2025-12-04 10:03:27.017291][1483.651766104] 2025-12-04T10:03:27.0175479Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:03:27.0179570Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_overrides.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:03:27.017687] 2025-12-04T10:03:32.6633087Z 2025-12-04T10:03:32.6633863Z test_overrides 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_overrides_1.1_02cd6dd5329f6857_.log 2025-12-04T10:03:32.6685939Z Running 250 items in this shard: test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_get_mode_stack, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_getitem_call, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_modes_handle_first, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_reentrant_mode_idiom, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api, test/test_overrides.py::TestTorchFunctionMode::test_torch_function_all_disabled_api 2025-12-04T10:03:32.6736731Z 2025-12-04T10:03:32.6736922Z Finished test_overrides 1/1 ... [2025-12-04 10:03:32.663438][1489.297911043], took 0.09min 2025-12-04T10:03:32.6737600Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_overrides/test_overrides-d70ad67a6515a66b.xml 2025-12-04T10:03:32.7035886Z Running inductor/test_max_autotune 1/1 ... [2025-12-04 10:03:32.703350][1489.33782719] 2025-12-04T10:03:32.7036329Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:03:32.7039696Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_max_autotune.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:03:32.703674] 2025-12-04T10:03:38.1080482Z 2025-12-04T10:03:38.1081369Z inductor/test_max_autotune 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_max_autotune_1.1_6e5671ef4b4366ba_.log 2025-12-04T10:03:38.1082016Z 2025-12-04T10:03:38.1082580Z Finished inductor/test_max_autotune 1/1 ... [2025-12-04 10:03:38.107782][1494.742258877], took 0.09min 2025-12-04T10:03:38.1102227Z Running doctests 1/1 ... [2025-12-04 10:03:38.110016][1494.744495395] 2025-12-04T10:03:38.6461094Z msg = Cannot scrape callname=Library.fallback in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py line=368. 2025-12-04T10:03:38.6461918Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:38.6462449Z Registers the function implementation as the fallback for the given key. 2025-12-04T10:03:38.6462758Z 2025-12-04T10:03:38.6462964Z This function only works for a library with global namespace ("_"). 2025-12-04T10:03:38.6463245Z 2025-12-04T10:03:38.6463325Z Args: 2025-12-04T10:03:38.6463694Z fn: function used as fallback for the given dispatch key or :func:`~fallthrough_kernel` 2025-12-04T10:03:38.6464136Z to register a fallthrough. 2025-12-04T10:03:38.6464915Z dispatch_key: dispatch key that the input function should be registered for. By default, it uses 2025-12-04T10:03:38.6465554Z the dispatch key that the library was created with. 2025-12-04T10:03:38.6466103Z with_keyset: flag controlling if the current dispatcher call keyset should be passed as the first argument 2025-12-04T10:03:38.6466805Z to :attr:`fn` when calling. This should be used to create the appropriate keyset for redispatch calls. 2025-12-04T10:03:38.6467283Z 2025-12-04T10:03:38.6467392Z Example:: 2025-12-04T10:03:38.6467513Z 2025-12-04T10:03:38.6467608Z >>> my_lib = Library("_", "IMPL") 2025-12-04T10:03:38.6467910Z >>> def fallback_kernel(op, *args, **kwargs): 2025-12-04T10:03:38.6468225Z >>> # Handle all autocast ops generically 2025-12-04T10:03:38.6468442Z >>> # ... 2025-12-04T10:03:38.6468648Z >>> my_lib.fallback(fallback_kernel, "Autocast") 2025-12-04T10:03:38.6468887Z 2025-12-04T10:03:38.6469410Z Original Error: IndentationError('expected an indented block after function definition on line 2', ('', 5, 1, 'my_lib.fallback(fallback_kernel, "Autocast")\n', 5, 7)) 2025-12-04T10:03:38.6469912Z 2025-12-04T10:03:38.6470000Z my_lib.fallback(fallback_kernel, "Autocast") 2025-12-04T10:03:38.6470220Z ^ 2025-12-04T10:03:38.6580194Z msg = Cannot scrape callname=register_fake in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py line=958. 2025-12-04T10:03:38.6580912Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:38.6581403Z Register a FakeTensor implementation ("fake impl") for this operator. 2025-12-04T10:03:38.6581701Z 2025-12-04T10:03:38.6581857Z Also sometimes known as a "meta kernel", "abstract impl". 2025-12-04T10:03:38.6582104Z 2025-12-04T10:03:38.6582309Z An "FakeTensor implementation" specifies the behavior of this operator on 2025-12-04T10:03:38.6582799Z Tensors that carry no data ("FakeTensor"). Given some input Tensors with 2025-12-04T10:03:38.6583287Z certain properties (sizes/strides/storage_offset/device), it specifies 2025-12-04T10:03:38.6583706Z what the properties of the output Tensors are. 2025-12-04T10:03:38.6583918Z 2025-12-04T10:03:38.6584132Z The FakeTensor implementation has the same signature as the operator. 2025-12-04T10:03:38.6584731Z It is run for both FakeTensors and meta tensors. To write a FakeTensor 2025-12-04T10:03:38.6585198Z implementation, assume that all Tensor inputs to the operator are 2025-12-04T10:03:38.6585644Z regular CPU/CUDA/Meta tensors, but they do not have storage, and 2025-12-04T10:03:38.6586097Z you are trying to return regular CPU/CUDA/Meta tensor(s) as output. 2025-12-04T10:03:38.6586571Z The FakeTensor implementation must consist of only PyTorch operations 2025-12-04T10:03:38.6587038Z (and may not directly access the storage or data of any input or 2025-12-04T10:03:38.6587604Z intermediate Tensors). 2025-12-04T10:03:38.6587771Z 2025-12-04T10:03:38.6587916Z This API may be used as a decorator (see examples). 2025-12-04T10:03:38.6588141Z 2025-12-04T10:03:38.6588275Z For a detailed guide on custom ops, please see 2025-12-04T10:03:38.6588635Z https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html 2025-12-04T10:03:38.6588867Z 2025-12-04T10:03:38.6588932Z Args: 2025-12-04T10:03:38.6589167Z op_name: Operator name (along with the overload) or OpOverload object. 2025-12-04T10:03:38.6589476Z func: Fake tensor implementation. 2025-12-04T10:03:38.6589775Z lib (Optional[Library]): Library to register the fake tensor to. 2025-12-04T10:03:38.6590117Z allow_override: Flag controlling if we want to override an 2025-12-04T10:03:38.6590433Z existing registered fake impl. This is by default off, 2025-12-04T10:03:38.6590750Z and will error you're trying to register a fake impl to 2025-12-04T10:03:38.6591202Z an operator that already has a fake impl. This also only 2025-12-04T10:03:38.6591506Z applies if the custom operator was not created via 2025-12-04T10:03:38.6591832Z torch.library.custom_op, as overriding and existing fake 2025-12-04T10:03:38.6592116Z impl is already allowed. 2025-12-04T10:03:38.6592260Z 2025-12-04T10:03:38.6592327Z Examples: 2025-12-04T10:03:38.6592478Z >>> import torch 2025-12-04T10:03:38.6592661Z >>> import numpy as np 2025-12-04T10:03:38.6592856Z >>> from torch import Tensor 2025-12-04T10:03:38.6593045Z >>> 2025-12-04T10:03:38.6593264Z >>> # Example 1: an operator without data-dependent output shape 2025-12-04T10:03:38.6593625Z >>> @torch.library.custom_op("mylib::custom_linear", mutates_args=()) 2025-12-04T10:03:38.6593993Z >>> def custom_linear(x: Tensor, weight: Tensor, bias: Tensor) -> Tensor: 2025-12-04T10:03:38.6594348Z >>> raise NotImplementedError("Implementation goes here") 2025-12-04T10:03:38.6594616Z >>> 2025-12-04T10:03:38.6594820Z >>> @torch.library.register_fake("mylib::custom_linear") 2025-12-04T10:03:38.6595078Z >>> def _(x, weight, bias): 2025-12-04T10:03:38.6595284Z >>> assert x.dim() == 2 2025-12-04T10:03:38.6595488Z >>> assert weight.dim() == 2 2025-12-04T10:03:38.6595701Z >>> assert bias.dim() == 1 2025-12-04T10:03:38.6595929Z >>> assert x.shape[1] == weight.shape[1] 2025-12-04T10:03:38.6596180Z >>> assert weight.shape[0] == bias.shape[0] 2025-12-04T10:03:38.6596419Z >>> assert x.device == weight.device 2025-12-04T10:03:38.6596633Z >>> 2025-12-04T10:03:38.6596806Z >>> return (x @ weight.t()) + bias 2025-12-04T10:03:38.6597016Z >>> 2025-12-04T10:03:38.6597208Z >>> with torch._subclasses.fake_tensor.FakeTensorMode(): 2025-12-04T10:03:38.6597477Z >>> x = torch.randn(2, 3) 2025-12-04T10:03:38.6597683Z >>> w = torch.randn(3, 3) 2025-12-04T10:03:38.6597882Z >>> b = torch.randn(3) 2025-12-04T10:03:38.6598138Z >>> y = torch.ops.mylib.custom_linear(x, w, b) 2025-12-04T10:03:38.6598365Z >>> 2025-12-04T10:03:38.6598513Z >>> assert y.shape == (2, 3) 2025-12-04T10:03:38.6598709Z >>> 2025-12-04T10:03:38.6598959Z >>> # Example 2: an operator with data-dependent output shape 2025-12-04T10:03:38.6599319Z >>> @torch.library.custom_op("mylib::custom_nonzero", mutates_args=()) 2025-12-04T10:03:38.6599630Z >>> def custom_nonzero(x: Tensor) -> Tensor: 2025-12-04T10:03:38.6599870Z >>> x_np = x.numpy(force=True) 2025-12-04T10:03:38.6600108Z >>> res = np.stack(np.nonzero(x_np), axis=1) 2025-12-04T10:03:38.6600359Z >>> return torch.tensor(res, device=x.device) 2025-12-04T10:03:38.6600585Z >>> 2025-12-04T10:03:38.6600793Z >>> @torch.library.register_fake("mylib::custom_nonzero") 2025-12-04T10:03:38.6601108Z >>> def _(x): 2025-12-04T10:03:38.6601313Z >>> # Number of nonzero-elements is data-dependent. 2025-12-04T10:03:38.6601600Z >>> # Since we cannot peek at the data in an fake impl, 2025-12-04T10:03:38.6601896Z >>> # we use the ctx object to construct a new symint that 2025-12-04T10:03:38.6602165Z >>> # represents the data-dependent size. 2025-12-04T10:03:38.6602412Z >>> ctx = torch.library.get_ctx() 2025-12-04T10:03:38.6602640Z >>> nnz = ctx.new_dynamic_size() 2025-12-04T10:03:38.6602853Z >>> shape = [nnz, x.dim()] 2025-12-04T10:03:38.6603095Z >>> result = x.new_empty(shape, dtype=torch.int64) 2025-12-04T10:03:38.6603337Z >>> return result 2025-12-04T10:03:38.6603509Z >>> 2025-12-04T10:03:38.6603718Z >>> from torch.fx.experimental.proxy_tensor import make_fx 2025-12-04T10:03:38.6603968Z >>> 2025-12-04T10:03:38.6604130Z >>> x = torch.tensor([0, 1, 2, 3, 4, 0]) 2025-12-04T10:03:38.6604536Z >>> trace = make_fx(torch.ops.mylib.custom_nonzero, tracing_mode="symbolic")(x) 2025-12-04T10:03:38.6604868Z >>> trace.print_readable() 2025-12-04T10:03:38.6605064Z >>> 2025-12-04T10:03:38.6605288Z >>> assert torch.allclose(trace(x), torch.ops.mylib.custom_nonzero(x)) 2025-12-04T10:03:38.6605521Z 2025-12-04T10:03:38.6605574Z 2025-12-04T10:03:38.6606008Z Original Error: IndentationError('expected an indented block after function definition on line 37', ('', 38, 1, '_._ = None\n', 38, 2)) 2025-12-04T10:03:38.6606432Z 2025-12-04T10:03:38.6606493Z _._ = None 2025-12-04T10:03:38.6606623Z ^ 2025-12-04T10:03:38.6700038Z msg = Cannot scrape callname=get_kernel in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py line=1530. 2025-12-04T10:03:38.6700762Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:38.6701248Z Returns the computed kernel for a given operator and dispatch key. 2025-12-04T10:03:38.6701520Z 2025-12-04T10:03:38.6701712Z This function retrieves the kernel that would be executed for a given 2025-12-04T10:03:38.6702190Z operator and dispatch key combination. The returned SafeKernelFunction 2025-12-04T10:03:38.6702640Z can be used to call the kernel in a boxed fashion. The intended use 2025-12-04T10:03:38.6703062Z case for this function is to retrieve the original kernel for a given 2025-12-04T10:03:38.6703511Z dispatch key and then register another kernel to the same dispatch key 2025-12-04T10:03:38.6703936Z that calls into the original kernel for certain cases. 2025-12-04T10:03:38.6704167Z 2025-12-04T10:03:38.6704244Z Args: 2025-12-04T10:03:38.6704504Z op: Operator name (along with the overload) or OpOverload object 2025-12-04T10:03:38.6704965Z Can be a string (e.g., "aten::add.Tensor"), an OpOverload, or a CustomOpDef. 2025-12-04T10:03:38.6705482Z dispatch_key (str | torch.DispatchKey): The dispatch key to get the kernel for. 2025-12-04T10:03:38.6705954Z Can be a string (e.g., "CPU", "CUDA") or a DispatchKey enum value. 2025-12-04T10:03:38.6706227Z 2025-12-04T10:03:38.6706299Z Returns: 2025-12-04T10:03:38.6706616Z torch._C._SafeKernelFunction: A safe kernel function that can be used to 2025-12-04T10:03:38.6706994Z call the kernel. 2025-12-04T10:03:38.6707144Z 2025-12-04T10:03:38.6707416Z Raises: 2025-12-04T10:03:38.6707659Z RuntimeError: If the operator does not exist. 2025-12-04T10:03:38.6707877Z 2025-12-04T10:03:38.6707956Z Example: 2025-12-04T10:03:38.6708160Z >>> # Get the CPU kernel for torch.add 2025-12-04T10:03:38.6708479Z >>> kernel = torch.library.get_kernel("aten::add.Tensor", "CPU") 2025-12-04T10:03:38.6708744Z >>> 2025-12-04T10:03:38.6708924Z >>> # You can also use DispatchKey enum 2025-12-04T10:03:38.6709262Z >>> kernel = torch.library.get_kernel("aten::add.Tensor", torch.DispatchKey.CPU) 2025-12-04T10:03:38.6709661Z >>> 2025-12-04T10:03:38.6709834Z >>> # Or use an OpOverload directly 2025-12-04T10:03:38.6710135Z >>> kernel = torch.library.get_kernel(torch.ops.aten.add.Tensor, "CPU") 2025-12-04T10:03:38.6710425Z >>> 2025-12-04T10:03:38.6710650Z >>> # Example: Using get_kernel in a custom op with conditional dispatch 2025-12-04T10:03:38.6710977Z >>> # Get the original kernel for torch.sin 2025-12-04T10:03:38.6711294Z >>> original_sin_kernel = torch.library.get_kernel("aten::sin", "CPU") 2025-12-04T10:03:38.6711578Z >>> 2025-12-04T10:03:38.6711804Z >>> # If input has negative values, use original sin, otherwise return zeros 2025-12-04T10:03:38.6712142Z >>> def conditional_sin_impl(dispatch_keys, x): 2025-12-04T10:03:38.6712388Z >>> if (x < 0).any(): 2025-12-04T10:03:38.6712638Z >>> return original_sin_kernel.call_boxed(dispatch_keys, x) 2025-12-04T10:03:38.6712980Z >>> else: 2025-12-04T10:03:38.6713221Z >>> return torch.zeros_like(x) 2025-12-04T10:03:38.6713426Z >>> 2025-12-04T10:03:38.6713606Z >>> lib = torch.library.Library("aten", "IMPL") 2025-12-04T10:03:38.6713948Z >>> # with_keyset=True so the first argument to the impl is the current DispatchKeySet 2025-12-04T10:03:38.6714330Z >>> which needs to be the first argument to ``kernel.call_boxed`` 2025-12-04T10:03:38.6714655Z >>> lib.impl("sin", conditional_sin_impl, "CPU", with_keyset=True) 2025-12-04T10:03:38.6714916Z >>> 2025-12-04T10:03:38.6715071Z >>> # Test the conditional behavior 2025-12-04T10:03:38.6715293Z >>> x_positive = torch.tensor([1.0, 2.0]) 2025-12-04T10:03:38.6715526Z >>> x_mixed = torch.tensor([-1.0, 2.0]) 2025-12-04T10:03:38.6715760Z >>> torch.sin(x_positive) 2025-12-04T10:03:38.6715954Z tensor([0., 0.]) 2025-12-04T10:03:38.6716132Z >>> torch.sin(x_mixed) 2025-12-04T10:03:38.6716324Z tensor([-0.8415, 0.9093]) 2025-12-04T10:03:38.6716501Z 2025-12-04T10:03:38.6716901Z Original Error: SyntaxError('invalid syntax', ('', 23, 7, 'which needs to be the first argument to ``kernel.call_boxed``\n', 23, 12)) 2025-12-04T10:03:38.6717317Z 2025-12-04T10:03:38.6717440Z which needs to be the first argument to ``kernel.call_boxed`` 2025-12-04T10:03:38.6717698Z ^ 2025-12-04T10:03:38.9192335Z msg = Cannot scrape callname=is_available in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/accelerator/__init__.py line=70. 2025-12-04T10:03:38.9193193Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:38.9193727Z Check if the current accelerator is available at runtime: it was build, all the 2025-12-04T10:03:38.9194227Z required drivers are available and at least one device is visible. 2025-12-04T10:03:38.9194637Z See :ref:`accelerator` for details. 2025-12-04T10:03:38.9194862Z 2025-12-04T10:03:38.9194950Z Returns: 2025-12-04T10:03:38.9195303Z bool: A boolean indicating if there is an available :ref:`accelerator`. 2025-12-04T10:03:38.9195658Z 2025-12-04T10:03:38.9195900Z .. note:: This API delegates to the device-specific version of `is_available`. 2025-12-04T10:03:38.9196427Z On CUDA, when the environment variable ``PYTORCH_NVML_BASED_CUDA_CHECK=1`` is set, 2025-12-04T10:03:38.9197247Z this function will NOT poison fork. Otherwise, it will. For more details, see 2025-12-04T10:03:38.9197701Z :ref:`multiprocessing-poison-fork-note`. 2025-12-04T10:03:38.9197926Z 2025-12-04T10:03:38.9198010Z Example:: 2025-12-04T10:03:38.9198138Z 2025-12-04T10:03:38.9198343Z >>> assert torch.accelerator.is_available() "No available accelerators detected." 2025-12-04T10:03:38.9198662Z 2025-12-04T10:03:38.9199113Z Original Error: SyntaxError('invalid syntax', ('', 1, 41, 'assert torch.accelerator.is_available() "No available accelerators detected."\n', 1, 78)) 2025-12-04T10:03:38.9199680Z 2025-12-04T10:03:38.9199859Z assert torch.accelerator.is_available() "No available accelerators detected." 2025-12-04T10:03:38.9200186Z ^ 2025-12-04T10:03:38.9219914Z msg = Cannot scrape callname=synchronize in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/accelerator/__init__.py line=239. 2025-12-04T10:03:38.9220685Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:38.9221181Z Wait for all kernels in all streams on the given device to complete. 2025-12-04T10:03:38.9221457Z 2025-12-04T10:03:38.9221533Z Args: 2025-12-04T10:03:38.9221918Z device (:class:`torch.device`, str, int, optional): device for which to synchronize. It must match 2025-12-04T10:03:38.9222486Z the current :ref:`accelerator` device type. If not given, 2025-12-04T10:03:38.9222962Z use :func:`torch.accelerator.current_device_index` by default. 2025-12-04T10:03:38.9223379Z 2025-12-04T10:03:38.9223758Z .. note:: This function is a no-op if the current :ref:`accelerator` is not initialized. 2025-12-04T10:03:38.9224127Z 2025-12-04T10:03:38.9224214Z Example:: 2025-12-04T10:03:38.9224329Z 2025-12-04T10:03:38.9224449Z >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA) 2025-12-04T10:03:38.9224889Z >>> assert torch.accelerator.is_available() "No available accelerators detected." 2025-12-04T10:03:38.9225353Z >>> start_event = torch.Event(enable_timing=True) 2025-12-04T10:03:38.9225681Z >>> end_event = torch.Event(enable_timing=True) 2025-12-04T10:03:38.9225980Z >>> start_event.record() 2025-12-04T10:03:38.9226348Z >>> tensor = torch.randn(100, device=torch.accelerator.current_accelerator()) 2025-12-04T10:03:38.9226754Z >>> sum = torch.sum(tensor) 2025-12-04T10:03:38.9227008Z >>> end_event.record() 2025-12-04T10:03:38.9227370Z >>> torch.accelerator.synchronize() 2025-12-04T10:03:38.9227734Z >>> elapsed_time_ms = start_event.elapsed_time(end_event) 2025-12-04T10:03:38.9228035Z 2025-12-04T10:03:38.9228555Z Original Error: SyntaxError('invalid syntax', ('', 2, 41, 'assert torch.accelerator.is_available() "No available accelerators detected."\n', 2, 78)) 2025-12-04T10:03:38.9229006Z 2025-12-04T10:03:38.9229190Z assert torch.accelerator.is_available() "No available accelerators detected." 2025-12-04T10:03:38.9229509Z ^ 2025-12-04T10:03:38.9459439Z msg = Cannot scrape callname=cudart in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py line=448. 2025-12-04T10:03:38.9460170Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:38.9460579Z Retrieves the CUDA runtime API module. 2025-12-04T10:03:38.9460762Z 2025-12-04T10:03:38.9460766Z 2025-12-04T10:03:38.9460983Z This function initializes the CUDA runtime environment if it is not already 2025-12-04T10:03:38.9461488Z initialized and returns the CUDA runtime API module (_cudart). The CUDA 2025-12-04T10:03:38.9461965Z runtime API module provides access to various CUDA runtime functions. 2025-12-04T10:03:38.9462249Z 2025-12-04T10:03:38.9462327Z Args: 2025-12-04T10:03:38.9462489Z ``None`` 2025-12-04T10:03:38.9462614Z 2025-12-04T10:03:38.9462684Z Returns: 2025-12-04T10:03:38.9463025Z module: The CUDA runtime API module (_cudart). 2025-12-04T10:03:38.9463248Z 2025-12-04T10:03:38.9463325Z Raises: 2025-12-04T10:03:38.9463617Z RuntimeError: If CUDA cannot be re-initialized in a forked subprocess. 2025-12-04T10:03:38.9464206Z AssertionError: If PyTorch is not compiled with CUDA support or if libcudart functions are unavailable. 2025-12-04T10:03:38.9464603Z 2025-12-04T10:03:38.9464742Z Example of CUDA operations with profiling: 2025-12-04T10:03:38.9465024Z >>> import torch 2025-12-04T10:03:38.9465369Z >>> from torch.cuda import cudart, check_error 2025-12-04T10:03:38.9465652Z >>> import os 2025-12-04T10:03:38.9465848Z >>> 2025-12-04T10:03:38.9466042Z >>> os.environ["CUDA_PROFILE"] = "1" 2025-12-04T10:03:38.9466297Z >>> 2025-12-04T10:03:38.9466513Z >>> def perform_cuda_operations_with_streams(): 2025-12-04T10:03:38.9466812Z >>> stream = torch.cuda.Stream() 2025-12-04T10:03:38.9467097Z >>> with torch.cuda.stream(stream): 2025-12-04T10:03:38.9467487Z >>> x = torch.randn(100, 100, device='cuda') 2025-12-04T10:03:38.9467805Z >>> y = torch.randn(100, 100, device='cuda') 2025-12-04T10:03:38.9468080Z >>> z = torch.mul(x, y) 2025-12-04T10:03:38.9468335Z >>> return z 2025-12-04T10:03:38.9468498Z >>> 2025-12-04T10:03:38.9468651Z >>> torch.cuda.synchronize() 2025-12-04T10:03:38.9468885Z >>> print("====== Start nsys profiling ======") 2025-12-04T10:03:38.9469147Z >>> check_error(cudart().cudaProfilerStart()) 2025-12-04T10:03:38.9469558Z >>> with torch.autograd.profiler.emit_nvtx(): 2025-12-04T10:03:38.9469855Z >>> result = perform_cuda_operations_with_streams() 2025-12-04T10:03:38.9470127Z >>> print("CUDA operations completed.") 2025-12-04T10:03:38.9470392Z >>> check_error(torch.cuda.cudart().cudaProfilerStop()) 2025-12-04T10:03:38.9470664Z >>> print("====== End nsys profiling ======") 2025-12-04T10:03:38.9470828Z 2025-12-04T10:03:38.9470964Z To run this example and save the profiling information, execute: 2025-12-04T10:03:38.9471434Z >>> $ nvprof --profile-from-start off --csv --print-summary -o trace_name.prof -f -- python cudart_test.py 2025-12-04T10:03:38.9471753Z 2025-12-04T10:03:38.9471916Z This command profiles the CUDA operations in the provided script and saves 2025-12-04T10:03:38.9472292Z the profiling information to a file named `trace_name.prof`. 2025-12-04T10:03:38.9472669Z The `--profile-from-start off` option ensures that profiling starts only 2025-12-04T10:03:38.9473001Z after the `cudaProfilerStart` call in the script. 2025-12-04T10:03:38.9473315Z The `--csv` and `--print-summary` options format the profiling output as a 2025-12-04T10:03:38.9473634Z CSV file and print a summary, respectively. 2025-12-04T10:03:38.9473963Z The `-o` option specifies the output file name, and the `-f` option forces the 2025-12-04T10:03:38.9474302Z overwrite of the output file if it already exists. 2025-12-04T10:03:38.9474533Z 2025-12-04T10:03:38.9475055Z Original Error: SyntaxError('invalid syntax', ('', 1, 1, '$ nvprof --profile-from-start off --csv --print-summary -o trace_name.prof -f -- python cudart_test.py\n', 1, 2)) 2025-12-04T10:03:38.9475569Z 2025-12-04T10:03:38.9475809Z $ nvprof --profile-from-start off --csv --print-summary -o trace_name.prof -f -- python cudart_test.py 2025-12-04T10:03:38.9476172Z ^ 2025-12-04T10:03:40.1354291Z msg = Cannot scrape callname=vmap in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/apis.py line=39. 2025-12-04T10:03:40.1355074Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:40.1355565Z 2025-12-04T10:03:40.1355766Z vmap is the vectorizing map; ``vmap(func)`` returns a new function that 2025-12-04T10:03:40.1356209Z maps ``func`` over some dimension of the inputs. Semantically, vmap 2025-12-04T10:03:40.1356950Z pushes the map into PyTorch operations called by ``func``, effectively 2025-12-04T10:03:40.1357363Z vectorizing those operations. 2025-12-04T10:03:40.1357532Z 2025-12-04T10:03:40.1357719Z vmap is useful for handling batch dimensions: one can write a function 2025-12-04T10:03:40.1358150Z ``func`` that runs on examples and then lift it to a function that can 2025-12-04T10:03:40.1358535Z take batches of examples with ``vmap(func)``. vmap can also be used to 2025-12-04T10:03:40.1358872Z compute batched gradients when composed with autograd. 2025-12-04T10:03:40.1359165Z 2025-12-04T10:03:40.1359245Z .. note:: 2025-12-04T10:03:40.1359478Z :func:`torch.vmap` is aliased to :func:`torch.func.vmap` for 2025-12-04T10:03:40.1359773Z convenience. Use whichever one you'd like. 2025-12-04T10:03:40.1359941Z 2025-12-04T10:03:40.1360004Z Args: 2025-12-04T10:03:40.1360230Z func (function): A Python function that takes one or more arguments. 2025-12-04T10:03:40.1360540Z Must return one or more Tensors. 2025-12-04T10:03:40.1360830Z in_dims (int or nested structure): Specifies which dimension of the 2025-12-04T10:03:40.1361160Z inputs should be mapped over. ``in_dims`` should have a 2025-12-04T10:03:40.1361494Z structure like the inputs. If the ``in_dim`` for a particular 2025-12-04T10:03:40.1361828Z input is None, then that indicates there is no map dimension. 2025-12-04T10:03:40.1362080Z Default: 0. 2025-12-04T10:03:40.1362314Z out_dims (int or Tuple[int]): Specifies where the mapped dimension 2025-12-04T10:03:40.1362734Z should appear in the outputs. If ``out_dims`` is a Tuple, then 2025-12-04T10:03:40.1363112Z it should have one element per output. Default: 0. 2025-12-04T10:03:40.1363418Z randomness (str): Specifies whether the randomness in this 2025-12-04T10:03:40.1363754Z vmap should be the same or different across batches. If 'different', 2025-12-04T10:03:40.1364121Z the randomness for each batch will be different. If 'same', the 2025-12-04T10:03:40.1364469Z randomness will be the same across batches. If 'error', any calls to 2025-12-04T10:03:40.1364826Z random functions will error. Default: 'error'. WARNING: this flag 2025-12-04T10:03:40.1365176Z only applies to random PyTorch operations and does not apply to 2025-12-04T10:03:40.1365477Z Python's random module or numpy randomness. 2025-12-04T10:03:40.1365791Z chunk_size (None or int): If None (default), apply a single vmap over inputs. 2025-12-04T10:03:40.1366167Z If not None, then compute the vmap :attr:`chunk_size` samples at a time. 2025-12-04T10:03:40.1366567Z Note that :attr:`chunk_size=1` is equivalent to computing the vmap with a for-loop. 2025-12-04T10:03:40.1366989Z If you run into memory issues computing the vmap, please try a non-None chunk_size. 2025-12-04T10:03:40.1367247Z 2025-12-04T10:03:40.1367304Z Returns: 2025-12-04T10:03:40.1367519Z Returns a new "batched" function. It takes the same inputs as 2025-12-04T10:03:40.1367851Z ``func``, except each input has an extra dimension at the index 2025-12-04T10:03:40.1368170Z specified by ``in_dims``. It takes returns the same outputs as 2025-12-04T10:03:40.1368509Z ``func``, except each output has an extra dimension at the index 2025-12-04T10:03:40.1368790Z specified by ``out_dims``. 2025-12-04T10:03:40.1368916Z 2025-12-04T10:03:40.1368973Z .. warning: 2025-12-04T10:03:40.1369194Z :func:`vmap` works best with functional-style code. Please do not 2025-12-04T10:03:40.1369531Z perform any side-effects in ``func``, with the exception of 2025-12-04T10:03:40.1369897Z in-place PyTorch operations. Examples of side-effects include mutating 2025-12-04T10:03:40.1370278Z Python data structures and assigning values to variables not captured 2025-12-04T10:03:40.1370563Z in ``func``. 2025-12-04T10:03:40.1370657Z 2025-12-04T10:03:40.1370817Z One example of using :func:`vmap` is to compute batched dot products. PyTorch 2025-12-04T10:03:40.1371295Z doesn't provide a batched ``torch.dot`` API; instead of unsuccessfully 2025-12-04T10:03:40.1371669Z rummaging through docs, use :func:`vmap` to construct a new function. 2025-12-04T10:03:40.1371894Z 2025-12-04T10:03:40.1371994Z >>> torch.dot # [D], [D] -> [] 2025-12-04T10:03:40.1372277Z >>> batched_dot = torch.func.vmap(torch.dot) # [N, D], [N, D] -> [N] 2025-12-04T10:03:40.1372571Z >>> x, y = torch.randn(2, 5), torch.randn(2, 5) 2025-12-04T10:03:40.1372798Z >>> batched_dot(x, y) 2025-12-04T10:03:40.1372967Z 2025-12-04T10:03:40.1373133Z :func:`vmap` can be helpful in hiding batch dimensions, leading to a simpler 2025-12-04T10:03:40.1373433Z model authoring experience. 2025-12-04T10:03:40.1373562Z 2025-12-04T10:03:40.1373642Z >>> batch_size, feature_size = 3, 5 2025-12-04T10:03:40.1373907Z >>> weights = torch.randn(feature_size, requires_grad=True) 2025-12-04T10:03:40.1374153Z >>> 2025-12-04T10:03:40.1374296Z >>> def model(feature_vec): 2025-12-04T10:03:40.1374533Z >>> # Very simple linear model with activation 2025-12-04T10:03:40.1374782Z >>> return feature_vec.dot(weights).relu() 2025-12-04T10:03:40.1374993Z >>> 2025-12-04T10:03:40.1375199Z >>> examples = torch.randn(batch_size, feature_size) 2025-12-04T10:03:40.1375460Z >>> result = torch.vmap(model)(examples) 2025-12-04T10:03:40.1375611Z 2025-12-04T10:03:40.1375776Z :func:`vmap` can also help vectorize computations that were previously difficult 2025-12-04T10:03:40.1376188Z or impossible to batch. One example is higher-order gradient computation. 2025-12-04T10:03:40.1376665Z The PyTorch autograd engine computes vjps (vector-Jacobian products). 2025-12-04T10:03:40.1377042Z Computing a full Jacobian matrix for some function f: R^N -> R^N usually 2025-12-04T10:03:40.1377462Z requires N calls to ``autograd.grad``, one per Jacobian row. Using :func:`vmap`, 2025-12-04T10:03:40.1377868Z we can vectorize the whole computation, computing the Jacobian in a single 2025-12-04T10:03:40.1378187Z call to ``autograd.grad``. 2025-12-04T10:03:40.1378308Z 2025-12-04T10:03:40.1378366Z >>> # Setup 2025-12-04T10:03:40.1378518Z >>> N = 5 2025-12-04T10:03:40.1378673Z >>> f = lambda x: x**2 2025-12-04T10:03:40.1378863Z >>> x = torch.randn(N, requires_grad=True) 2025-12-04T10:03:40.1379075Z >>> y = f(x) 2025-12-04T10:03:40.1379242Z >>> I_N = torch.eye(N) 2025-12-04T10:03:40.1379401Z >>> 2025-12-04T10:03:40.1379546Z >>> # Sequential approach 2025-12-04T10:03:40.1379822Z >>> jacobian_rows = [torch.autograd.grad(y, x, v, retain_graph=True)[0] 2025-12-04T10:03:40.1380117Z >>> for v in I_N.unbind()] 2025-12-04T10:03:40.1380349Z >>> jacobian = torch.stack(jacobian_rows) 2025-12-04T10:03:40.1380555Z >>> 2025-12-04T10:03:40.1380712Z >>> # vectorized gradient computation 2025-12-04T10:03:40.1380918Z >>> def get_vjp(v): 2025-12-04T10:03:40.1381106Z >>> return torch.autograd.grad(y, x, v) 2025-12-04T10:03:40.1381343Z >>> jacobian = torch.vmap(get_vjp)(I_N) 2025-12-04T10:03:40.1381491Z 2025-12-04T10:03:40.1381669Z :func:`vmap` can also be nested, producing an output with multiple batched dimensions 2025-12-04T10:03:40.1381929Z 2025-12-04T10:03:40.1382013Z >>> torch.dot # [D], [D] -> [] 2025-12-04T10:03:40.1382224Z >>> batched_dot = torch.vmap( 2025-12-04T10:03:40.1382420Z ... torch.vmap(torch.dot) 2025-12-04T10:03:40.1382629Z ... ) # [N1, N0, D], [N1, N0, D] -> [N1, N0] 2025-12-04T10:03:40.1382886Z >>> x, y = torch.randn(2, 3, 5), torch.randn(2, 3, 5) 2025-12-04T10:03:40.1383144Z >>> batched_dot(x, y) # tensor of size [2, 3] 2025-12-04T10:03:40.1383301Z 2025-12-04T10:03:40.1383459Z If the inputs are not batched along the first dimension, ``in_dims`` specifies 2025-12-04T10:03:40.1383799Z the dimension that each inputs are batched along as 2025-12-04T10:03:40.1383970Z 2025-12-04T10:03:40.1384048Z >>> torch.dot # [N], [N] -> [] 2025-12-04T10:03:40.1384368Z >>> batched_dot = torch.vmap(torch.dot, in_dims=1) # [N, D], [N, D] -> [D] 2025-12-04T10:03:40.1384693Z >>> x, y = torch.randn(2, 5), torch.randn(2, 5) 2025-12-04T10:03:40.1384911Z >>> batched_dot( 2025-12-04T10:03:40.1385075Z ... x, y 2025-12-04T10:03:40.1385304Z ... ) # output is [5] instead of [2] if batched along the 0th dimension 2025-12-04T10:03:40.1385516Z 2025-12-04T10:03:40.1385689Z If there are multiple inputs each of which is batched along different dimensions, 2025-12-04T10:03:40.1386076Z ``in_dims`` must be a tuple with the batch dimension for each input as 2025-12-04T10:03:40.1386332Z 2025-12-04T10:03:40.1386400Z >>> torch.dot # [D], [D] -> [] 2025-12-04T10:03:40.1386692Z >>> batched_dot = torch.vmap(torch.dot, in_dims=(0, None)) # [N, D], [D] -> [N] 2025-12-04T10:03:40.1387007Z >>> x, y = torch.randn(2, 5), torch.randn(5) 2025-12-04T10:03:40.1387340Z >>> batched_dot( 2025-12-04T10:03:40.1387498Z ... x, y 2025-12-04T10:03:40.1387726Z ... ) # second arg doesn't have a batch dim because in_dim[1] was None 2025-12-04T10:03:40.1387932Z 2025-12-04T10:03:40.1388090Z If the input is a Python struct, ``in_dims`` must be a tuple containing a struct 2025-12-04T10:03:40.1388399Z matching the shape of the input: 2025-12-04T10:03:40.1388552Z 2025-12-04T10:03:40.1388650Z >>> f = lambda dict: torch.dot(dict["x"], dict["y"]) 2025-12-04T10:03:40.1388900Z >>> x, y = torch.randn(2, 5), torch.randn(5) 2025-12-04T10:03:40.1389119Z >>> input = {"x": x, "y": y} 2025-12-04T10:03:40.1389356Z >>> batched_dot = torch.vmap(f, in_dims=({"x": 0, "y": None},)) 2025-12-04T10:03:40.1389688Z >>> batched_dot(input) 2025-12-04T10:03:40.1389807Z 2025-12-04T10:03:40.1390046Z By default, the output is batched along the first dimension. However, it can be batched 2025-12-04T10:03:40.1390394Z along any dimension by using ``out_dims`` 2025-12-04T10:03:40.1390554Z 2025-12-04T10:03:40.1390620Z >>> f = lambda x: x**2 2025-12-04T10:03:40.1390813Z >>> x = torch.randn(2, 5) 2025-12-04T10:03:40.1391028Z >>> batched_pow = torch.vmap(f, out_dims=1) 2025-12-04T10:03:40.1391255Z >>> batched_pow(x) # [5, 2] 2025-12-04T10:03:40.1391388Z 2025-12-04T10:03:40.1391585Z For any function that uses kwargs, the returned function will not batch the kwargs but will 2025-12-04T10:03:40.1391920Z accept kwargs 2025-12-04T10:03:40.1392011Z 2025-12-04T10:03:40.1392078Z >>> x = torch.randn([2, 5]) 2025-12-04T10:03:40.1392270Z >>> def fn(x, scale=4.): 2025-12-04T10:03:40.1392464Z >>> return x * scale 2025-12-04T10:03:40.1392629Z >>> 2025-12-04T10:03:40.1392785Z >>> batched_pow = torch.vmap(fn) 2025-12-04T10:03:40.1393023Z >>> assert torch.allclose(batched_pow(x), x * 4) 2025-12-04T10:03:40.1393344Z >>> batched_pow(x, scale=x) # scale is not batched, output has shape [2, 2, 5] 2025-12-04T10:03:40.1393573Z 2025-12-04T10:03:40.1393632Z .. note:: 2025-12-04T10:03:40.1393865Z vmap does not provide general autobatching or handle variable-length 2025-12-04T10:03:40.1394159Z sequences out of the box. 2025-12-04T10:03:40.1394284Z 2025-12-04T10:03:40.1394637Z Original Error: IndentationError('expected an indented block after function definition on line 4', ('', 5, 1, '_._ = None\n', 5, 2)) 2025-12-04T10:03:40.1395063Z 2025-12-04T10:03:40.1395119Z _._ = None 2025-12-04T10:03:40.1395256Z ^ 2025-12-04T10:03:40.1395674Z msg = Cannot scrape callname=grad in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/apis.py line=306. 2025-12-04T10:03:40.1396229Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:40.1396631Z ``grad`` operator helps computing gradients of ``func`` with respect to the 2025-12-04T10:03:40.1396999Z input(s) specified by ``argnums``. This operator can be nested to 2025-12-04T10:03:40.1397284Z compute higher-order gradients. 2025-12-04T10:03:40.1397427Z 2025-12-04T10:03:40.1397494Z Args: 2025-12-04T10:03:40.1397724Z func (Callable): A Python function that takes one or more arguments. 2025-12-04T10:03:40.1398171Z Must return a single-element Tensor. If specified ``has_aux`` equals ``True``, 2025-12-04T10:03:40.1398614Z function can return a tuple of single-element Tensor and other auxiliary objects: 2025-12-04T10:03:40.1398940Z ``(output, aux)``. 2025-12-04T10:03:40.1399257Z argnums (int or Tuple[int]): Specifies arguments to compute gradients with respect to. 2025-12-04T10:03:40.1399662Z ``argnums`` can be single integer or tuple of integers. Default: 0. 2025-12-04T10:03:40.1400066Z has_aux (bool): Flag indicating that ``func`` returns a tensor and other 2025-12-04T10:03:40.1400408Z auxiliary objects: ``(output, aux)``. Default: False. 2025-12-04T10:03:40.1400600Z 2025-12-04T10:03:40.1400659Z Returns: 2025-12-04T10:03:40.1400940Z Function to compute gradients with respect to its inputs. By default, the output of 2025-12-04T10:03:40.1401362Z the function is the gradient tensor(s) with respect to the first argument. 2025-12-04T10:03:40.1401780Z If specified ``has_aux`` equals ``True``, tuple of gradients and output auxiliary objects 2025-12-04T10:03:40.1402211Z is returned. If ``argnums`` is a tuple of integers, a tuple of output gradients with 2025-12-04T10:03:40.1402562Z respect to each ``argnums`` value is returned. 2025-12-04T10:03:40.1402734Z 2025-12-04T10:03:40.1402812Z Example of using ``grad``: 2025-12-04T10:03:40.1402948Z 2025-12-04T10:03:40.1403017Z >>> # xdoctest: +SKIP 2025-12-04T10:03:40.1403222Z >>> from torch.func import grad 2025-12-04T10:03:40.1403513Z >>> x = torch.randn([]) 2025-12-04T10:03:40.1403727Z >>> cos_x = grad(lambda x: torch.sin(x))(x) 2025-12-04T10:03:40.1403988Z >>> assert torch.allclose(cos_x, x.cos()) 2025-12-04T10:03:40.1404197Z >>> 2025-12-04T10:03:40.1404357Z >>> # Second-order gradients 2025-12-04T10:03:40.1404605Z >>> neg_sin_x = grad(grad(lambda x: torch.sin(x)))(x) 2025-12-04T10:03:40.1404882Z >>> assert torch.allclose(neg_sin_x, -x.sin()) 2025-12-04T10:03:40.1405045Z 2025-12-04T10:03:40.1405215Z When composed with ``vmap``, ``grad`` can be used to compute per-sample-gradients: 2025-12-04T10:03:40.1405466Z 2025-12-04T10:03:40.1405532Z >>> # xdoctest: +SKIP 2025-12-04T10:03:40.1405737Z >>> from torch.func import grad, vmap 2025-12-04T10:03:40.1405964Z >>> batch_size, feature_size = 3, 5 2025-12-04T10:03:40.1406169Z >>> 2025-12-04T10:03:40.1406328Z >>> def model(weights, feature_vec): 2025-12-04T10:03:40.1406562Z >>> # Very simple linear model with activation 2025-12-04T10:03:40.1406802Z >>> assert feature_vec.dim() == 1 2025-12-04T10:03:40.1407030Z >>> return feature_vec.dot(weights).relu() 2025-12-04T10:03:40.1407239Z >>> 2025-12-04T10:03:40.1407409Z >>> def compute_loss(weights, example, target): 2025-12-04T10:03:40.1407646Z >>> y = model(weights, example) 2025-12-04T10:03:40.1407894Z >>> return ((y - target) ** 2).mean() # MSELoss 2025-12-04T10:03:40.1408112Z >>> 2025-12-04T10:03:40.1408317Z >>> weights = torch.randn(feature_size, requires_grad=True) 2025-12-04T10:03:40.1408618Z >>> examples = torch.randn(batch_size, feature_size) 2025-12-04T10:03:40.1408869Z >>> targets = torch.randn(batch_size) 2025-12-04T10:03:40.1409108Z >>> inputs = (weights, examples, targets) 2025-12-04T10:03:40.1409423Z >>> grad_weight_per_example = vmap(grad(compute_loss), in_dims=(None, 0, 0))( 2025-12-04T10:03:40.1409727Z ... *inputs 2025-12-04T10:03:40.1409888Z ... ) 2025-12-04T10:03:40.1409981Z 2025-12-04T10:03:40.1410101Z Example of using ``grad`` with ``has_aux`` and ``argnums``: 2025-12-04T10:03:40.1410295Z 2025-12-04T10:03:40.1410367Z >>> # xdoctest: +SKIP 2025-12-04T10:03:40.1410564Z >>> from torch.func import grad 2025-12-04T10:03:40.1410835Z >>> def my_loss_func(y, y_pred): 2025-12-04T10:03:40.1411079Z >>> loss_per_sample = (0.5 * y_pred - y) ** 2 2025-12-04T10:03:40.1411319Z >>> loss = loss_per_sample.mean() 2025-12-04T10:03:40.1411566Z >>> return loss, (y_pred, loss_per_sample) 2025-12-04T10:03:40.1411787Z >>> 2025-12-04T10:03:40.1411977Z >>> fn = grad(my_loss_func, argnums=(0, 1), has_aux=True) 2025-12-04T10:03:40.1412220Z >>> y_true = torch.rand(4) 2025-12-04T10:03:40.1412445Z >>> y_preds = torch.rand(4, requires_grad=True) 2025-12-04T10:03:40.1412724Z >>> out = fn(y_true, y_preds) 2025-12-04T10:03:40.1413030Z >>> # > output is ((grads w.r.t y_true, grads w.r.t y_preds), (y_pred, loss_per_sample)) 2025-12-04T10:03:40.1413290Z 2025-12-04T10:03:40.1413350Z .. note:: 2025-12-04T10:03:40.1413560Z Using PyTorch ``torch.no_grad`` together with ``grad``. 2025-12-04T10:03:40.1413748Z 2025-12-04T10:03:40.1413856Z Case 1: Using ``torch.no_grad`` inside a function: 2025-12-04T10:03:40.1414032Z 2025-12-04T10:03:40.1414102Z >>> # xdoctest: +SKIP 2025-12-04T10:03:40.1414299Z >>> def f(x): 2025-12-04T10:03:40.1414482Z >>> with torch.no_grad(): 2025-12-04T10:03:40.1414687Z >>> c = x ** 2 2025-12-04T10:03:40.1414893Z >>> return x - c 2025-12-04T10:03:40.1415020Z 2025-12-04T10:03:40.1415165Z In this case, ``grad(f)(x)`` will respect the inner ``torch.no_grad``. 2025-12-04T10:03:40.1415378Z 2025-12-04T10:03:40.1415514Z Case 2: Using ``grad`` inside ``torch.no_grad`` context manager: 2025-12-04T10:03:40.1415761Z 2025-12-04T10:03:40.1415904Z >>> # xdoctest: +SKIP 2025-12-04T10:03:40.1416116Z >>> with torch.no_grad(): 2025-12-04T10:03:40.1416319Z >>> grad(f)(x) 2025-12-04T10:03:40.1416435Z 2025-12-04T10:03:40.1416584Z In this case, ``grad`` will respect the inner ``torch.no_grad``, but not the 2025-12-04T10:03:40.1416962Z outer one. This is because ``grad`` is a "function transform": its result 2025-12-04T10:03:40.1417333Z should not depend on the result of a context manager outside of ``f``. 2025-12-04T10:03:40.1417555Z 2025-12-04T10:03:40.1417613Z 2025-12-04T10:03:40.1418038Z Original Error: IndentationError('expected an indented block after function definition on line 5', ('', 6, 1, '_._ = None\n', 6, 2)) 2025-12-04T10:03:40.1418464Z 2025-12-04T10:03:40.1418520Z _._ = None 2025-12-04T10:03:40.1418667Z ^ 2025-12-04T10:03:42.6389880Z msg = Cannot scrape callname=CustomOpDef.register_fake in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py line=402. 2025-12-04T10:03:42.6390787Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:42.6391259Z Register a FakeTensor implementation for this custom op. 2025-12-04T10:03:42.6391511Z 2025-12-04T10:03:42.6391726Z This is necessary to get the operator to work efficiently with torch.compile. 2025-12-04T10:03:42.6392053Z 2025-12-04T10:03:42.6392244Z The Fake impl (sometimes also known as a meta kernel or abstract impl) 2025-12-04T10:03:42.6392719Z specifies the behavior of this operator on Tensors that carry no data. 2025-12-04T10:03:42.6393136Z Given some input Tensors with certain properties 2025-12-04T10:03:42.6393586Z (sizes/strides/storage_offset/device), it specifies what the properties of 2025-12-04T10:03:42.6393988Z the output Tensors are. 2025-12-04T10:03:42.6394150Z 2025-12-04T10:03:42.6394341Z Please see :func:`torch.library.register_fake` for more details. 2025-12-04T10:03:42.6394618Z 2025-12-04T10:03:42.6394690Z Args: 2025-12-04T10:03:42.6394955Z fn (Callable): The function to register as the FakeTensor 2025-12-04T10:03:42.6395282Z implementation. 2025-12-04T10:03:42.6395435Z 2025-12-04T10:03:42.6395509Z Examples: 2025-12-04T10:03:42.6395711Z >>> import torch 2025-12-04T10:03:42.6396234Z >>> import numpy as np 2025-12-04T10:03:42.6396520Z >>> from torch import Tensor 2025-12-04T10:03:42.6396768Z >>> 2025-12-04T10:03:42.6397049Z >>> # Example 1: an operator without data-dependent output shape 2025-12-04T10:03:42.6397484Z >>> @torch.library.custom_op("mylib::linear", mutates_args=()) 2025-12-04T10:03:42.6397895Z >>> def linear(x: Tensor, weight: Tensor, bias: Tensor) -> Tensor: 2025-12-04T10:03:42.6406623Z >>> return (x @ weight.t()) + bias 2025-12-04T10:03:42.6407054Z >>> 2025-12-04T10:03:42.6407250Z >>> @linear.register_fake 2025-12-04T10:03:42.6407473Z >>> def _(x, weight, bias): 2025-12-04T10:03:42.6407682Z >>> assert x.dim() == 2 2025-12-04T10:03:42.6407896Z >>> assert weight.dim() == 2 2025-12-04T10:03:42.6408121Z >>> assert bias.dim() == 1 2025-12-04T10:03:42.6408360Z >>> assert x.shape[1] == weight.shape[1] 2025-12-04T10:03:42.6408627Z >>> assert weight.shape[0] == bias.shape[0] 2025-12-04T10:03:42.6408888Z >>> assert x.device == weight.device 2025-12-04T10:03:42.6409158Z >>> return x.new_empty(x.size(0), weight.size(0)) 2025-12-04T10:03:42.6409395Z >>> 2025-12-04T10:03:42.6409561Z >>> x = torch.randn(2, 2) 2025-12-04T10:03:42.6409779Z >>> weight = torch.randn(2, 2) 2025-12-04T10:03:42.6410011Z >>> bias = torch.randn(2) 2025-12-04T10:03:42.6410249Z >>> # xdoctest: +SKIP("Requires Python <= 3.11") 2025-12-04T10:03:42.6410717Z >>> out = torch.compile(linear, fullgraph=True)(x, weight, bias) 2025-12-04T10:03:42.6411025Z >>> # xdoctest: +SKIP("Requires Python <= 3.11") 2025-12-04T10:03:42.6411361Z >>> assert torch.allclose(out, torch.nn.functional.linear(x, weight, bias)) 2025-12-04T10:03:42.6411668Z >>> 2025-12-04T10:03:42.6411882Z >>> # Example 2: an operator with data-dependent output shape 2025-12-04T10:03:42.6412225Z >>> @torch.library.custom_op("mylib::nonzero", mutates_args=()) 2025-12-04T10:03:42.6412543Z >>> def nonzero(x: Tensor) -> Tensor: 2025-12-04T10:03:42.6412789Z >>> x_np = x.cpu().numpy() 2025-12-04T10:03:42.6413016Z >>> res = np.stack(np.nonzero(x_np), axis=1) 2025-12-04T10:03:42.6413276Z >>> return torch.tensor(res, device=x.device) 2025-12-04T10:03:42.6413496Z >>> 2025-12-04T10:03:42.6413663Z >>> @nonzero.register_fake 2025-12-04T10:03:42.6413873Z >>> def _(x): 2025-12-04T10:03:42.6414091Z >>> # Number of nonzero-elements is data-dependent. 2025-12-04T10:03:42.6414385Z >>> # Since we cannot peek at the data in an abstract impl, 2025-12-04T10:03:42.6414683Z >>> # we use the ctx object to construct a new symint that 2025-12-04T10:03:42.6414958Z >>> # represents the data-dependent size. 2025-12-04T10:03:42.6415206Z >>> ctx = torch.library.get_ctx() 2025-12-04T10:03:42.6415432Z >>> nnz = ctx.new_dynamic_size() 2025-12-04T10:03:42.6415661Z >>> shape = [nnz, x.dim()] 2025-12-04T10:03:42.6415916Z >>> result = x.new_empty(shape, dtype=torch.int64) 2025-12-04T10:03:42.6416157Z >>> return result 2025-12-04T10:03:42.6416361Z >>> 2025-12-04T10:03:42.6416538Z >>> x = torch.tensor([0, 1, 2, 0, 0, 1]) 2025-12-04T10:03:42.6416791Z >>> # xdoctest: +SKIP("Requires Python <= 3.11") 2025-12-04T10:03:42.6417061Z >>> out = torch.compile(nonzero, fullgraph=True)(x) 2025-12-04T10:03:42.6417342Z >>> # xdoctest: +SKIP("Requires Python <= 3.11") 2025-12-04T10:03:42.6417595Z >>> assert torch.allclose(out, x.nonzero()) 2025-12-04T10:03:42.6417754Z 2025-12-04T10:03:42.6417812Z 2025-12-04T10:03:42.6418746Z Original Error: IndentationError('expected an indented block after function definition on line 36', ('', 37, 1, '_._ = None\n', 37, 2)) 2025-12-04T10:03:42.6419195Z 2025-12-04T10:03:42.6419258Z _._ = None 2025-12-04T10:03:42.6419406Z ^ 2025-12-04T10:03:42.6558930Z msg = Cannot scrape callname=unsafe_generate_fake_kernels in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/fake_profile.py line=94. 2025-12-04T10:03:42.6559790Z Caused by: DoctestParseError('Failed to parse doctest in _label_docsrc_lines') 2025-12-04T10:03:42.6560276Z 2025-12-04T10:03:42.6560489Z Registers a fake kernel based on the given operator profiles. This fake 2025-12-04T10:03:42.6560997Z kernel registration will override any existing fake kernel registrations. 2025-12-04T10:03:42.6561299Z 2025-12-04T10:03:42.6561481Z The input is a dictionary mapping operator names to a set of operator 2025-12-04T10:03:42.6561952Z profiles, which we will use to generate fake kernels. The operator profiles 2025-12-04T10:03:42.6562409Z are a record of the input and output tensor metadata. Based on this 2025-12-04T10:03:42.6562863Z information we will match a given input to the recorded profile, and return 2025-12-04T10:03:42.6563341Z an output with the same metadata as in the recorded profile. If a profile 2025-12-04T10:03:42.6563749Z doesn't exist then an exception will be thrown. 2025-12-04T10:03:42.6563968Z 2025-12-04T10:03:42.6564172Z The fake kernel generation is considered unsafe because it relies on the 2025-12-04T10:03:42.6564643Z rigid, pre-defined operator profiles that do not account for potential 2025-12-04T10:03:42.6565346Z variations in output behavior. Specifically, the generated kernels assume a 2025-12-04T10:03:42.6565880Z fixed relationship between input and output ranks. However, in reality, it's 2025-12-04T10:03:42.6566405Z possible that data-dependent operations may produce outputs of different 2025-12-04T10:03:42.6566885Z ranks even when given inputs of the same rank. The generated fake kernels 2025-12-04T10:03:42.6567349Z are inflexible and unable to accommodate these nuances, making them 2025-12-04T10:03:42.6567702Z potentially unsafe. 2025-12-04T10:03:42.6567834Z 2025-12-04T10:03:42.6567903Z Args: 2025-12-04T10:03:42.6568188Z op_profiles (dict[str, set[OpProfile]]): A dictionary mapping operator 2025-12-04T10:03:42.6568645Z name to a set of operator profiles from which we will generate fake 2025-12-04T10:03:42.6568979Z kernels. 2025-12-04T10:03:42.6569096Z 2025-12-04T10:03:42.6569167Z Examples: 2025-12-04T10:03:42.6569276Z 2025-12-04T10:03:42.6569430Z >>> # Example: Registering an op-profile from draft-export 2025-12-04T10:03:42.6569697Z >>> import torch 2025-12-04T10:03:42.6569912Z >>> from torch.export._draft_export import draft_export 2025-12-04T10:03:42.6570153Z >>> 2025-12-04T10:03:42.6570365Z >>> @torch.library.custom_op("mylib::foo", mutates_args=()) 2025-12-04T10:03:42.6570643Z >>> def foo(x: Tensor, y: Tensor) -> Tensor: 2025-12-04T10:03:42.6570867Z >>> return x + y 2025-12-04T10:03:42.6571047Z >>> 2025-12-04T10:03:42.6571198Z >>> class M(torch.nn.Module): 2025-12-04T10:03:42.6571407Z >>> def forward(self, a, b): 2025-12-04T10:03:42.6571649Z >>> res = torch.ops.mylib.foo(a, b) # no fake impl 2025-12-04T10:03:42.6571884Z >>> return res 2025-12-04T10:03:42.6572048Z >>> 2025-12-04T10:03:42.6572245Z >>> ep = draft_export(M(), (torch.ones(3, 4), torch.ones(3, 4)) 2025-12-04T10:03:42.6572492Z >>> 2025-12-04T10:03:42.6572762Z >>> with torch._library.fake_profile.unsafe_generate_fake_kernels(ep._report.op_profiles): 2025-12-04T10:03:42.6573133Z >>> decomp = ep.run_decompositions() 2025-12-04T10:03:42.6573283Z 2025-12-04T10:03:42.6573286Z 2025-12-04T10:03:42.6573614Z Original Error: IncompleteParseError('ill-formed doctest: all parts have been processed but the doctest source is not balanced') 2025-12-04T10:03:42.6574007Z 2025-12-04T10:03:43.0978109Z msg = Cannot scrape callname=ActivationSparsifier in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/_experimental/activation_sparsifier/activation_sparsifier.py line=16. 2025-12-04T10:03:43.0979164Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:43.0979475Z 2025-12-04T10:03:43.0979707Z The Activation sparsifier class aims to sparsify/prune activations in a neural 2025-12-04T10:03:43.0980209Z network. The idea is to attach the sparsifier to a layer (or layers) and it 2025-12-04T10:03:43.0980693Z zeroes out the activations based on the mask_fn (or sparsification function) 2025-12-04T10:03:43.0981184Z input by the user. 2025-12-04T10:03:43.0981500Z The mask_fn is applied once all the inputs are aggregated and reduced i.e. 2025-12-04T10:03:43.0981909Z mask = mask_fn(reduce_fn(aggregate_fn(activations))) 2025-12-04T10:03:43.0982133Z 2025-12-04T10:03:43.0982231Z Note:: 2025-12-04T10:03:43.0982600Z The sparsification mask is computed on the input **before it goes through the attached layer**. 2025-12-04T10:03:43.0982966Z 2025-12-04T10:03:43.0983035Z Args: 2025-12-04T10:03:43.0983228Z model (nn.Module): 2025-12-04T10:03:43.0983549Z The model whose layers will be sparsified. The layers that needs to be 2025-12-04T10:03:43.0984026Z sparsified should be added separately using the register_layer() function 2025-12-04T10:03:43.0984423Z aggregate_fn (Optional, Callable): 2025-12-04T10:03:43.0984822Z default aggregate_fn that is used if not specified while registering the layer. 2025-12-04T10:03:43.0985274Z specifies how inputs should be aggregated over time. 2025-12-04T10:03:43.0985913Z The aggregate_fn should usually take 2 torch tensors and return the aggregated tensor. 2025-12-04T10:03:43.0986323Z Example 2025-12-04T10:03:43.0986591Z def add_agg_fn(tensor1, tensor2): return tensor1 + tensor2 2025-12-04T10:03:43.0986921Z reduce_fn (Optional, Callable): 2025-12-04T10:03:43.0987436Z default reduce_fn that is used if not specified while registering the layer. 2025-12-04T10:03:43.0987965Z reduce_fn will be called on the aggregated tensor i.e. the tensor obtained after 2025-12-04T10:03:43.0988372Z calling agg_fn() on all inputs. 2025-12-04T10:03:43.0988636Z Example 2025-12-04T10:03:43.0988945Z def mean_reduce_fn(agg_tensor): return agg_tensor.mean(dim=0) 2025-12-04T10:03:43.0989305Z mask_fn (Optional, Callable): 2025-12-04T10:03:43.0989705Z default mask_fn that is used to create the sparsification mask using the tensor obtained after 2025-12-04T10:03:43.0990179Z calling the reduce_fn(). This is used by default if a custom one is passed in the 2025-12-04T10:03:43.0990513Z register_layer(). 2025-12-04T10:03:43.0990889Z Note that the mask_fn() definition should contain the sparse arguments that is passed in sparse_config 2025-12-04T10:03:43.0991267Z arguments. 2025-12-04T10:03:43.0991462Z features (Optional, list): 2025-12-04T10:03:43.0991695Z default selected features to sparsify. 2025-12-04T10:03:43.0992032Z If this is non-empty, then the mask_fn will be applied for each feature of the input. 2025-12-04T10:03:43.0992353Z For example, 2025-12-04T10:03:43.0992647Z mask = [mask_fn(reduce_fn(aggregated_fn(input[feature])) for feature in features] 2025-12-04T10:03:43.0992966Z feature_dim (Optional, int): 2025-12-04T10:03:43.0993301Z default dimension of input features. Again, features along this dim will be chosen 2025-12-04T10:03:43.0993638Z for sparsification. 2025-12-04T10:03:43.0993851Z sparse_config (Dict): 2025-12-04T10:03:43.0994144Z Default configuration for the mask_fn. This config will be passed 2025-12-04T10:03:43.0994442Z with the mask_fn() 2025-12-04T10:03:43.0994578Z 2025-12-04T10:03:43.0994637Z Example: 2025-12-04T10:03:43.0994856Z >>> # xdoctest: +SKIP 2025-12-04T10:03:43.0995040Z >>> model = SomeModel() 2025-12-04T10:03:43.0995329Z >>> act_sparsifier = ActivationSparsifier(...) # init activation sparsifier 2025-12-04T10:03:43.0995653Z >>> # Initialize aggregate_fn 2025-12-04T10:03:43.0995850Z >>> def agg_fn(x, y): 2025-12-04T10:03:43.0996027Z >>> return x + y 2025-12-04T10:03:43.0996192Z >>> 2025-12-04T10:03:43.0996336Z >>> # Initialize reduce_fn 2025-12-04T10:03:43.0996541Z >>> def reduce_fn(x): 2025-12-04T10:03:43.0996780Z >>> return torch.mean(x, dim=0) 2025-12-04T10:03:43.0996977Z >>> 2025-12-04T10:03:43.0997124Z >>> # Initialize mask_fn 2025-12-04T10:03:43.0997308Z >>> def mask_fn(data): 2025-12-04T10:03:43.0997511Z >>> return torch.eye(data.shape).to(data.device) 2025-12-04T10:03:43.0997734Z >>> 2025-12-04T10:03:43.0997860Z >>> 2025-12-04T10:03:43.0998017Z >>> act_sparsifier.register_layer( 2025-12-04T10:03:43.0998228Z ... model.some_layer, 2025-12-04T10:03:43.0998418Z ... aggregate_fn=agg_fn, 2025-12-04T10:03:43.0998607Z ... reduce_fn=reduce_fn, 2025-12-04T10:03:43.0998788Z ... mask_fn=mask_fn, 2025-12-04T10:03:43.0998964Z ... ) 2025-12-04T10:03:43.0999098Z >>> 2025-12-04T10:03:43.0999237Z >>> # start training process 2025-12-04T10:03:43.0999421Z >>> for _ in [...]: 2025-12-04T10:03:43.0999589Z >>> # epoch starts 2025-12-04T10:03:43.0999811Z >>> # model.forward(), compute_loss() and model.backwards() 2025-12-04T10:03:43.1000125Z >>> # epoch ends 2025-12-04T10:03:43.1000297Z >>> act_sparsifier.step() 2025-12-04T10:03:43.1000529Z >>> # end training process 2025-12-04T10:03:43.1000725Z >>> sparsifier.squash_mask() 2025-12-04T10:03:43.1000851Z 2025-12-04T10:03:43.1001195Z Original Error: IndentationError("expected an indented block after 'for' statement on line 25", ('', 26, 1, '_._ = None\n', 26, 2)) 2025-12-04T10:03:43.1001605Z 2025-12-04T10:03:43.1001678Z _._ = None 2025-12-04T10:03:43.1001813Z ^ 2025-12-04T10:03:43.7414608Z msg = Cannot scrape callname=DeviceMesh.__getitem__ in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/device_mesh.py line=547. 2025-12-04T10:03:43.7415473Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:43.7415794Z 2025-12-04T10:03:43.7416033Z Slice the current DeviceMesh based on the mesh_dim_names given to create a submesh. 2025-12-04T10:03:43.7416583Z The submesh created consists of the dimensions and the communicators indicated by 2025-12-04T10:03:43.7417003Z ``mesh_dim_names`` 2025-12-04T10:03:43.7417139Z 2025-12-04T10:03:43.7417213Z Args: 2025-12-04T10:03:43.7417509Z mesh_dim_names (Union[str, Tuple[str]]): the name or the tuple of names of the 2025-12-04T10:03:43.7417966Z mesh dimension of the DeviceMesh to create the submesh for. 2025-12-04T10:03:43.7418283Z Returns: 2025-12-04T10:03:43.7418479Z A :class:`DeviceMesh` object 2025-12-04T10:03:43.7418649Z 2025-12-04T10:03:43.7418897Z The following program runs on each process/rank in an SPMD manner in a world size of 8. 2025-12-04T10:03:43.7419321Z In the first example: 2025-12-04T10:03:43.7419672Z Calling mesh_2d["tp"] on rank 0, 1, 2, 3 returns a 1D submesh of DeviceMesh:([0, 1, 2, 3]). 2025-12-04T10:03:43.7420076Z Calling mesh_2d["tp"] on rank 4, 5, 6, 7 returns a 1D submesh of DeviceMesh:([4, 5, 6, 7]). 2025-12-04T10:03:43.7420488Z Calling mesh_2d["dp"] on rank 0, 4 returns a 1D submesh of DeviceMesh:([0, 4]). 2025-12-04T10:03:43.7420874Z Calling mesh_2d["dp"] on rank 1, 5 returns a 1D submesh of DeviceMesh:([1, 5]). 2025-12-04T10:03:43.7421238Z Calling mesh_2d["dp"] on rank 2, 6 returns a 1D submesh of DeviceMesh:([2, 6]). 2025-12-04T10:03:43.7421603Z Calling mesh_2d["dp"] on rank 3, 7 returns a 1D submesh of DeviceMesh:([3, 7]). 2025-12-04T10:03:43.7421824Z 2025-12-04T10:03:43.7421898Z In the second example: 2025-12-04T10:03:43.7422456Z Calling mesh_3d["dp", "cp"] on rank 0, 1, 4, 5 returns a 2D submesh of DeviceMesh:([[0, 1], [4, 5]]). 2025-12-04T10:03:43.7422916Z Calling mesh_3d["dp", "cp"] on rank 2, 3, 6, 7 returns a 2D submesh of DeviceMesh:([[2, 3], [6, 7]]). 2025-12-04T10:03:43.7423337Z Calling mesh_3d["cp", "dp"] on rank 0, 1, 4, 5 returns a 2D submesh of DeviceMesh:([[0, 4], [1, 5]]). 2025-12-04T10:03:43.7423763Z Calling mesh_3d["cp", "dp"] on rank 2, 3, 6, 7 returns a 2D submesh of DeviceMesh:([[2, 6], [3, 7]]). 2025-12-04T10:03:43.7424011Z 2025-12-04T10:03:43.7424183Z Example:: 2025-12-04T10:03:43.7424283Z 2025-12-04T10:03:43.7424366Z >>> # xdoctest: +SKIP("no rank") 2025-12-04T10:03:43.7424635Z >>> from torch.distributed.device_mesh import DeviceMesh 2025-12-04T10:03:43.7424883Z >>> 2025-12-04T10:03:43.7425098Z >>> # Initialize a 2D device mesh as (2, 4) to represent the topology 2025-12-04T10:03:43.7425416Z >>> # of cross-host(dim 0), and within-host (dim 1). 2025-12-04T10:03:43.7425759Z >>> mesh_2d = init_device_mesh(device_type="cuda", (2,4), mesh_dim_names=("dp", "tp")) 2025-12-04T10:03:43.7426074Z >>> tp_mesh = mesh_2d["tp"] 2025-12-04T10:03:43.7426268Z >>> dp_mesh = mesh_2d["dp"] 2025-12-04T10:03:43.7426443Z >>> 2025-12-04T10:03:43.7426580Z >>> # Initialize a 3D mesh. 2025-12-04T10:03:43.7426907Z >>> mesh_3d = init_device_mesh(device_type="cuda", (2,2,2), mesh_dim_names=("dp", "pp", "cp")) 2025-12-04T10:03:43.7427476Z >>> # The order of the mesh_dim_names provided deteremines the order of dimensions in the submesh. 2025-12-04T10:03:43.7427929Z >>> dp_cp_mesh = mesh_3d["dp", "cp"] 2025-12-04T10:03:43.7428215Z >>> cp_dp_mesh = mesh_3d["cp", "dp"] 2025-12-04T10:03:43.7428361Z 2025-12-04T10:03:43.7428819Z Original Error: SyntaxError('positional argument follows keyword argument', ('', 6, 82, 'mesh_2d = init_device_mesh(device_type="cuda", (2,4), mesh_dim_names=("dp", "tp"))\n', 6, 83)) 2025-12-04T10:03:43.7429348Z 2025-12-04T10:03:43.7429515Z mesh_2d = init_device_mesh(device_type="cuda", (2,4), mesh_dim_names=("dp", "tp")) 2025-12-04T10:03:43.7429863Z ^ 2025-12-04T10:03:44.0245061Z msg = Cannot scrape callname=SavePlanner in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/planner.py line=122. 2025-12-04T10:03:44.0246032Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:44.0246359Z 2025-12-04T10:03:44.0246601Z Abstract class defining the protocol used by save_state_dict to plan the save process. 2025-12-04T10:03:44.0246975Z 2025-12-04T10:03:44.0247233Z SavePlanners are stateful objects that can be used to customize the whole save process. 2025-12-04T10:03:44.0247578Z 2025-12-04T10:03:44.0247817Z SavePlanner acts as an access proxy to the state_dict, so any transformation done to it 2025-12-04T10:03:44.0248237Z will be visible to the whole process. 2025-12-04T10:03:44.0248431Z 2025-12-04T10:03:44.0248664Z A planner subclass can expect the following sequence of calls during save_state_dict: 2025-12-04T10:03:44.0248998Z 2025-12-04T10:03:44.0249103Z 1) set_up_planner - called on all ranks. 2025-12-04T10:03:44.0249400Z Signals the start of a checkpoint save. 2025-12-04T10:03:44.0249591Z 2025-12-04T10:03:44.0249705Z 2) create_local_plan - called on all ranks. 2025-12-04T10:03:44.0250098Z Process the state_dict and produces a `SavePlan` that will be sent for global planning. 2025-12-04T10:03:44.0250369Z 2025-12-04T10:03:44.0250507Z 3) create_global_plan - called on the coordinator rank only. 2025-12-04T10:03:44.0250836Z Takes the SavePlan from all ranks and make any global decision. 2025-12-04T10:03:44.0251050Z 2025-12-04T10:03:44.0251136Z 4) finish_plan - called on all ranks. 2025-12-04T10:03:44.0251429Z This gives each rank a chance to adjust to global planning decisions. 2025-12-04T10:03:44.0251665Z 2025-12-04T10:03:44.0251921Z 5) resolve_data - called multiple times on each rank 2025-12-04T10:03:44.0252264Z Lookups a value on the `state_dict` for the storage layer to write. 2025-12-04T10:03:44.0252487Z 2025-12-04T10:03:44.0252697Z Users are recommended to extend DefaultSavePlanner instead of this interface directly as 2025-12-04T10:03:44.0253120Z most changes can be expressed by changes in a single method. 2025-12-04T10:03:44.0253319Z 2025-12-04T10:03:44.0253411Z There are 3 usual patterns of extension: 2025-12-04T10:03:44.0253579Z 2025-12-04T10:03:44.0253759Z Rewriting state_dict. This is the simplest way to extend the save process as it 2025-12-04T10:03:44.0254307Z doesn't requite understanding the intrincacies of how SavePlan works: 2025-12-04T10:03:44.0254549Z 2025-12-04T10:03:44.0254639Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T10:03:44.0254888Z >>> class RenamePlanner(DefaultSavePlanner): 2025-12-04T10:03:44.0255121Z >>> def set_up_planner( 2025-12-04T10:03:44.0255484Z >>> self, 2025-12-04T10:03:44.0255672Z >>> state_dict: STATE_DICT_TYPE, 2025-12-04T10:03:44.0255906Z >>> storage_meta: Optional[StorageMeta], 2025-12-04T10:03:44.0256143Z >>> is_coordinator: bool, 2025-12-04T10:03:44.0256339Z >>> ) -> None: 2025-12-04T10:03:44.0256509Z >>> # prefix all keys with `foo_`` 2025-12-04T10:03:44.0256855Z >>> super().set_up_planner({"foo_" + k: v for k, v in state_dict.items()}, storage_meta, is_coordinator) 2025-12-04T10:03:44.0257132Z 2025-12-04T10:03:44.0257364Z Modifying local plan and lookup in tandem. This is useful when fine control of how data is persisted 2025-12-04T10:03:44.0257765Z 2025-12-04T10:03:44.0257914Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T10:03:44.0258151Z >>> class FP16Planner(DefaultSavePlanner): 2025-12-04T10:03:44.0258414Z >>> def create_local_plan(self): 2025-12-04T10:03:44.0258640Z >>> plan = super().create_local_plan() 2025-12-04T10:03:44.0258857Z >>> for p in plan: 2025-12-04T10:03:44.0259062Z >>> if p.tensor_data is not None: 2025-12-04T10:03:44.0259334Z >>> p.tensor_data.properties.dtype = torch.float16 2025-12-04T10:03:44.0259587Z >>> return plan 2025-12-04T10:03:44.0259747Z >>> 2025-12-04T10:03:44.0259904Z >>> def resolve_data(self, write_item): 2025-12-04T10:03:44.0260164Z >>> item = super().resolve_data(write_item) 2025-12-04T10:03:44.0260514Z >>> return item if write_item.type == WriteItemType.BYTE_IO else item.to(torch.float16) 2025-12-04T10:03:44.0260789Z 2025-12-04T10:03:44.0261022Z Using the global planning step to make central decisions that can't be made individually by each rank 2025-12-04T10:03:44.0261347Z 2025-12-04T10:03:44.0261430Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T10:03:44.0261657Z >>> from itertools import zip_longest 2025-12-04T10:03:44.0261876Z >>> from dataclasses import replace 2025-12-04T10:03:44.0262135Z >>> class DDPLoadBalancingPlanner(DefaultSavePlanner): 2025-12-04T10:03:44.0262519Z >>> # This uses the default local plan behavior of having all non-sharded writes in rank 0 2025-12-04T10:03:44.0262875Z >>> # This sample doesn't handle ShardedTensors 2025-12-04T10:03:44.0263135Z >>> def create_global_plan(self, all_plans): 2025-12-04T10:03:44.0263406Z >>> iters = [iter(all_plans[0].items)] * len(all_plans) 2025-12-04T10:03:44.0263652Z >>> items_per_rank = [ 2025-12-04T10:03:44.0263881Z >>> [item for item in items if item is not None] 2025-12-04T10:03:44.0264164Z >>> for items in zip(*zip_longest(*iters), strict=True) 2025-12-04T10:03:44.0264411Z >>> ] 2025-12-04T10:03:44.0264570Z >>> all_plans = [ 2025-12-04T10:03:44.0264791Z >>> replace(plan, items=items) 2025-12-04T10:03:44.0265071Z >>> for plan, items in zip(all_plans, items_per_rank, strict=True) 2025-12-04T10:03:44.0265338Z >>> ] 2025-12-04T10:03:44.0265525Z >>> return super().create_global_plan(all_plans) 2025-12-04T10:03:44.0265698Z 2025-12-04T10:03:44.0265969Z Finally, some planners need to save additional metadata in the checkpoint, this is 2025-12-04T10:03:44.0266425Z accomplished by having each rank contribute their data items in the local plan and 2025-12-04T10:03:44.0266771Z the global planner aggregate them: 2025-12-04T10:03:44.0266921Z 2025-12-04T10:03:44.0266999Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T10:03:44.0267351Z >>> class SaveExtraDataPlanner(DefaultSavePlanner): 2025-12-04T10:03:44.0267627Z >>> def create_local_plan(self) -> SavePlan: 2025-12-04T10:03:44.0267869Z >>> plan = super().create_local_plan() 2025-12-04T10:03:44.0268215Z >>> return replace(plan, planner_data="per-rank-data") 2025-12-04T10:03:44.0268453Z >>> 2025-12-04T10:03:44.0268743Z >>> def create_global_plan(self, all_plans: List[SavePlan]) -> Tuple[List[SavePlan], Metadata]: 2025-12-04T10:03:44.0269167Z >>> global_plan, metadata = super().create_global_plan(all_plans) 2025-12-04T10:03:44.0269488Z >>> merged_data = [p.planner_data for p in global_plan] 2025-12-04T10:03:44.0269797Z >>> metadata = replace(metadata, planner_data=merged_data) 2025-12-04T10:03:44.0270081Z >>> return global_plan, metadata 2025-12-04T10:03:44.0270228Z 2025-12-04T10:03:44.0270597Z Original Error: IndentationError('expected an indented block after function definition on line 3', ('', 9, 0, '_._ = None\n', 9, -1)) 2025-12-04T10:03:44.0271034Z 2025-12-04T10:03:44.0271095Z _._ = None 2025-12-04T10:03:44.0271229Z ^ 2025-12-04T10:03:44.0271728Z msg = Cannot scrape callname=LoadPlanner in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/planner.py line=305. 2025-12-04T10:03:44.0272471Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:44.0272722Z 2025-12-04T10:03:44.0272925Z Abstract class defining the protocol used by load_state_dict to plan the load process. 2025-12-04T10:03:44.0273200Z 2025-12-04T10:03:44.0273394Z LoadPlanner are stateful objects that can be used to customize the whole load process. 2025-12-04T10:03:44.0273674Z 2025-12-04T10:03:44.0273866Z LoadPlanner acts as an access proxy to the state_dict, so any transformation done to it 2025-12-04T10:03:44.0274214Z will be visible to the whole process. 2025-12-04T10:03:44.0274357Z 2025-12-04T10:03:44.0274553Z A planner subclass can expect the following sequence of calls during load_state_dict: 2025-12-04T10:03:44.0274821Z 2025-12-04T10:03:44.0274903Z 1) set_up_planner - called on all ranks. 2025-12-04T10:03:44.0275147Z Signals the start of loading a checkpoint. 2025-12-04T10:03:44.0275316Z 2025-12-04T10:03:44.0275402Z 2) create_local_plan - called on all ranks. 2025-12-04T10:03:44.0275764Z Process the state_dict and produces a `LoadPlan` that will be sent for global planning. 2025-12-04T10:03:44.0276045Z 2025-12-04T10:03:44.0276170Z 3) create_global_plan - called on the coordinator rank only. 2025-12-04T10:03:44.0276507Z Takes the LoadPlan from all ranks and make any global decision. 2025-12-04T10:03:44.0276714Z 2025-12-04T10:03:44.0276822Z 4) load_bytes - called multiple times on each rank 2025-12-04T10:03:44.0277114Z This is called once per non-tensor value in state_dict. 2025-12-04T10:03:44.0277300Z 2025-12-04T10:03:44.0277452Z 5) resolve_tensor and commit_tensor - called multiple times on each rank 2025-12-04T10:03:44.0277809Z They are called in pair for each Tensor value in state_dict. 2025-12-04T10:03:44.0278009Z 2025-12-04T10:03:44.0278226Z Users are recommended to extend DefaultLoadPlanner instead of this interface directly as 2025-12-04T10:03:44.0278636Z most changes can be expressed by changes in a single method. 2025-12-04T10:03:44.0278841Z 2025-12-04T10:03:44.0278931Z There are two usual patterns of extension: 2025-12-04T10:03:44.0279102Z 2025-12-04T10:03:44.0279280Z Rewriting state_dict. This is the simplest way to extend the load process as it 2025-12-04T10:03:44.0279718Z doesn't requite understanding the intrincacies of how LoadPlan works. We need 2025-12-04T10:03:44.0280171Z to keep a reference to the original state_dict as load happens in place so 2025-12-04T10:03:44.0280496Z we need to be able to perform it in place 2025-12-04T10:03:44.0280657Z 2025-12-04T10:03:44.0280737Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T10:03:44.0280983Z >>> class RenamePlanner(DefaultLoadPlanner): 2025-12-04T10:03:44.0281210Z >>> def set_up_planner( 2025-12-04T10:03:44.0281402Z >>> self, 2025-12-04T10:03:44.0281575Z >>> state_dict: STATE_DICT_TYPE, 2025-12-04T10:03:44.0281788Z >>> metadata: Metadata, 2025-12-04T10:03:44.0282046Z >>> is_coordinator: bool, 2025-12-04T10:03:44.0282244Z >>> ) -> None: 2025-12-04T10:03:44.0282427Z >>> self.original_state_dict = state_dict 2025-12-04T10:03:44.0282728Z >>> state_dict = {"foo_" + k: v for k, v in state_dict.items()} 2025-12-04T10:03:44.0282985Z >>> 2025-12-04T10:03:44.0283138Z >>> if self.flatten_sharded_tensors: 2025-12-04T10:03:44.0283399Z >>> state_dict = _flatten_sharded_tensors(state_dict) 2025-12-04T10:03:44.0283641Z >>> 2025-12-04T10:03:44.0283796Z >>> if self.flatten_state_dict: 2025-12-04T10:03:44.0284064Z >>> state_dict, self.mappings = flatten_state_dict(state_dict) 2025-12-04T10:03:44.0284330Z >>> 2025-12-04T10:03:44.0284484Z >>> self.state_dict = state_dict 2025-12-04T10:03:44.0284701Z >>> self.metadata = metadata 2025-12-04T10:03:44.0284928Z >>> self.is_coordinator = is_coordinator 2025-12-04T10:03:44.0285141Z >>> 2025-12-04T10:03:44.0285297Z >>> def load_bytes(self, read_item, value): 2025-12-04T10:03:44.0285633Z >>> # Remove the "foo_" prefix 2025-12-04T10:03:44.0286292Z >>> self.original_state_dict[read_item.dest_index.fqn[4:]] = torch.load(value, weights_only=False) 2025-12-04T10:03:44.0286621Z 2025-12-04T10:03:44.0286624Z 2025-12-04T10:03:44.0286807Z Modifying resolve_tensor and commit_tensor to handle load time transformation. 2025-12-04T10:03:44.0287065Z 2025-12-04T10:03:44.0287143Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T10:03:44.0287397Z >>> class MetaModelMaterialize(DefaultSavePlanner): 2025-12-04T10:03:44.0287659Z >>> def resolve_tensor(self, read_item): 2025-12-04T10:03:44.0287890Z >>> tensor = super().resolve_tensor(read_item) 2025-12-04T10:03:44.0288154Z >>> return torch.empty_like(tensor, device="cpu") 2025-12-04T10:03:44.0288386Z >>> 2025-12-04T10:03:44.0288542Z >>> def commit_tensor(self, read_item, tensor): 2025-12-04T10:03:44.0288809Z >>> self.state_dict[read_item.dest_index.fqn] = tensor 2025-12-04T10:03:44.0288994Z 2025-12-04T10:03:44.0289362Z Original Error: IndentationError('expected an indented block after function definition on line 22', ('', 23, 0, '_._ = None\n', 23, -1)) 2025-12-04T10:03:44.0289805Z 2025-12-04T10:03:44.0289868Z _._ = None 2025-12-04T10:03:44.0290001Z ^ 2025-12-04T10:03:44.3362642Z msg = Cannot scrape callname=FullStateDictConfig in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/api.py line=295. 2025-12-04T10:03:44.3363535Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:44.3363857Z 2025-12-04T10:03:44.3364052Z ``FullStateDictConfig`` is a config class meant to be used with 2025-12-04T10:03:44.3364496Z ``StateDictType.FULL_STATE_DICT``. We recommend enabling both 2025-12-04T10:03:44.3364939Z ``offload_to_cpu=True`` and ``rank0_only=True`` when saving full state 2025-12-04T10:03:44.3365402Z dicts to save GPU memory and CPU memory, respectively. This config class 2025-12-04T10:03:44.3365856Z is meant to be used via the :func:`state_dict_type` context manager as 2025-12-04T10:03:44.3366186Z follows: 2025-12-04T10:03:44.3366299Z 2025-12-04T10:03:44.3366410Z >>> # xdoctest: +SKIP("undefined variables") 2025-12-04T10:03:44.3366825Z >>> from torch.distributed.fsdp import FullyShardedDataParallel as FSDP 2025-12-04T10:03:44.3367222Z >>> fsdp = FSDP(model, auto_wrap_policy=...) 2025-12-04T10:03:44.3367959Z >>> cfg = FullStateDictConfig(offload_to_cpu=True, rank0_only=True) 2025-12-04T10:03:44.3368455Z >>> with FSDP.state_dict_type(fsdp, StateDictType.FULL_STATE_DICT, cfg): 2025-12-04T10:03:44.3368855Z >>> state = fsdp.state_dict() 2025-12-04T10:03:44.3369209Z >>> # `state` will be empty on non rank 0 and contain CPU tensors on rank 0. 2025-12-04T10:03:44.3369684Z >>> # To reload checkpoint for inference, finetuning, transfer learning, etc: 2025-12-04T10:03:44.3370132Z >>> model = model_fn() # Initialize model in preparation for wrapping with FSDP 2025-12-04T10:03:44.3370526Z >>> if dist.get_rank() == 0: 2025-12-04T10:03:44.3370803Z >>> # Load checkpoint only on rank 0 to avoid memory redundancy 2025-12-04T10:03:44.3371104Z >>> state_dict = torch.load("my_checkpoint.pt") 2025-12-04T10:03:44.3371362Z >>> model.load_state_dict(state_dict) 2025-12-04T10:03:44.3371673Z >>> # All ranks initialize FSDP module as usual. `sync_module_states` argument 2025-12-04T10:03:44.3372074Z >>> # communicates loaded checkpoint states from rank 0 to rest of the world. 2025-12-04T10:03:44.3372375Z >>> fsdp = FSDP( 2025-12-04T10:03:44.3372532Z ... model, 2025-12-04T10:03:44.3372716Z ... device_id=torch.cuda.current_device(), 2025-12-04T10:03:44.3372951Z ... auto_wrap_policy=..., 2025-12-04T10:03:44.3373152Z ... sync_module_states=True, 2025-12-04T10:03:44.3373343Z ... ) 2025-12-04T10:03:44.3373571Z >>> # After this point, all ranks have FSDP model with loaded checkpoint. 2025-12-04T10:03:44.3373794Z 2025-12-04T10:03:44.3373860Z Attributes: 2025-12-04T10:03:44.3374218Z rank0_only (bool): If ``True``, then only rank 0 saves the full state 2025-12-04T10:03:44.3374591Z dict, and nonzero ranks save an empty dict. If ``False``, then all 2025-12-04T10:03:44.3374905Z ranks save the full state dict. (Default: ``False``) 2025-12-04T10:03:44.3375082Z 2025-12-04T10:03:44.3375421Z Original Error: IndentationError("expected an indented block after 'if' statement on line 10", ('', 11, 1, '_._ = None\n', 11, 2)) 2025-12-04T10:03:44.3375823Z 2025-12-04T10:03:44.3375881Z _._ = None 2025-12-04T10:03:44.3376018Z ^ 2025-12-04T10:03:46.0286735Z msg = Cannot scrape callname=register_parametrization in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/parametrize.py line=437. 2025-12-04T10:03:46.0287610Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:46.0288082Z Register a parametrization to a tensor in a module. 2025-12-04T10:03:46.0288336Z 2025-12-04T10:03:46.0288589Z Assume that ``tensor_name="weight"`` for simplicity. When accessing ``module.weight``, 2025-12-04T10:03:46.0289158Z the module will return the parametrized version ``parametrization(module.weight)``. 2025-12-04T10:03:46.0289720Z If the original tensor requires a gradient, the backward pass will differentiate 2025-12-04T10:03:46.0290290Z through :attr:`parametrization`, and the optimizer will update the tensor accordingly. 2025-12-04T10:03:46.0290638Z 2025-12-04T10:03:46.0290900Z The first time that a module registers a parametrization, this function will add an attribute 2025-12-04T10:03:46.0291404Z ``parametrizations`` to the module of type :class:`~ParametrizationList`. 2025-12-04T10:03:46.0291653Z 2025-12-04T10:03:46.0291826Z The list of parametrizations on the tensor ``weight`` will be accessible under 2025-12-04T10:03:46.0292182Z ``module.parametrizations.weight``. 2025-12-04T10:03:46.0292346Z 2025-12-04T10:03:46.0292440Z The original tensor will be accessible under 2025-12-04T10:03:46.0292722Z ``module.parametrizations.weight.original``. 2025-12-04T10:03:46.0292902Z 2025-12-04T10:03:46.0293085Z Parametrizations may be concatenated by registering several parametrizations 2025-12-04T10:03:46.0293423Z on the same attribute. 2025-12-04T10:03:46.0293546Z 2025-12-04T10:03:46.0293708Z The training mode of a registered parametrization is updated on registration 2025-12-04T10:03:46.0294292Z to match the training mode of the host module 2025-12-04T10:03:46.0294476Z 2025-12-04T10:03:46.0294694Z Parametrized parameters and buffers have an inbuilt caching system that can be activated 2025-12-04T10:03:46.0295059Z using the context manager :func:`cached`. 2025-12-04T10:03:46.0295232Z 2025-12-04T10:03:46.0295398Z A :attr:`parametrization` may optionally implement a method with signature 2025-12-04T10:03:46.0295643Z 2025-12-04T10:03:46.0295733Z .. code-block:: python 2025-12-04T10:03:46.0295851Z 2025-12-04T10:03:46.0296100Z def right_inverse(self, X: Tensor) -> Union[Tensor, Sequence[Tensor]] 2025-12-04T10:03:46.0296336Z 2025-12-04T10:03:46.0296522Z This method is called on the unparametrized tensor when the first parametrization 2025-12-04T10:03:46.0296930Z is registered to compute the initial value of the original tensor. 2025-12-04T10:03:46.0297356Z If this method is not implemented, the original tensor will be just the unparametrized tensor. 2025-12-04T10:03:46.0297651Z 2025-12-04T10:03:46.0297874Z If all the parametrizations registered on a tensor implement `right_inverse` it is possible 2025-12-04T10:03:46.0298354Z to initialize a parametrized tensor by assigning to it, as shown in the example below. 2025-12-04T10:03:46.0298626Z 2025-12-04T10:03:46.0298779Z It is possible for the first parametrization to depend on several inputs. 2025-12-04T10:03:46.0299175Z This may be implemented returning a tuple of tensors from ``right_inverse`` 2025-12-04T10:03:46.0299582Z (see the example implementation of a ``RankOne`` parametrization below). 2025-12-04T10:03:46.0299887Z 2025-12-04T10:03:46.0300175Z In this case, the unconstrained tensors are also located under ``module.parametrizations.weight`` 2025-12-04T10:03:46.0300583Z with names ``original0``, ``original1``,... 2025-12-04T10:03:46.0300757Z 2025-12-04T10:03:46.0300822Z .. note:: 2025-12-04T10:03:46.0300914Z 2025-12-04T10:03:46.0301107Z If unsafe=False (default) both the forward and right_inverse methods will be called 2025-12-04T10:03:46.0301468Z once to perform a number of consistency checks. 2025-12-04T10:03:46.0301821Z If unsafe=True, then right_inverse will be called if the tensor is not parametrized, 2025-12-04T10:03:46.0302172Z and nothing will be called otherwise. 2025-12-04T10:03:46.0302327Z 2025-12-04T10:03:46.0302383Z .. note:: 2025-12-04T10:03:46.0302474Z 2025-12-04T10:03:46.0302613Z In most situations, ``right_inverse`` will be a function such that 2025-12-04T10:03:46.0302913Z ``forward(right_inverse(X)) == X`` (see 2025-12-04T10:03:46.0303285Z `right inverse `_). 2025-12-04T10:03:46.0303724Z Sometimes, when the parametrization is not surjective, it may be reasonable 2025-12-04T10:03:46.0304051Z to relax this. 2025-12-04T10:03:46.0304162Z 2025-12-04T10:03:46.0304228Z .. warning:: 2025-12-04T10:03:46.0304322Z 2025-12-04T10:03:46.0304519Z If a parametrization depends on several inputs, :func:`~register_parametrization` 2025-12-04T10:03:46.0304955Z will register a number of new parameters. If such parametrization is registered 2025-12-04T10:03:46.0305397Z after the optimizer is created, these new parameters will need to be added manually 2025-12-04T10:03:46.0305802Z to the optimizer. See :meth:`torch.Optimizer.add_param_group`. 2025-12-04T10:03:46.0306010Z 2025-12-04T10:03:46.0306067Z Args: 2025-12-04T10:03:46.0306297Z module (nn.Module): module on which to register the parametrization 2025-12-04T10:03:46.0306687Z tensor_name (str): name of the parameter or buffer on which to register 2025-12-04T10:03:46.0306986Z the parametrization 2025-12-04T10:03:46.0307369Z parametrization (nn.Module): the parametrization to register 2025-12-04T10:03:46.0307650Z Keyword args: 2025-12-04T10:03:46.0307962Z unsafe (bool): a boolean flag that denotes whether the parametrization 2025-12-04T10:03:46.0308329Z may change the dtype and shape of the tensor. Default: `False` 2025-12-04T10:03:46.0308727Z Warning: the parametrization is not checked for consistency upon registration. 2025-12-04T10:03:46.0309089Z Enable this flag at your own risk. 2025-12-04T10:03:46.0309241Z 2025-12-04T10:03:46.0309305Z Raises: 2025-12-04T10:03:46.0309586Z ValueError: if the module does not have a parameter or a buffer named :attr:`tensor_name` 2025-12-04T10:03:46.0309926Z 2025-12-04T10:03:46.0309987Z Examples: 2025-12-04T10:03:46.0310189Z >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_LAPACK) 2025-12-04T10:03:46.0310446Z >>> import torch 2025-12-04T10:03:46.0310631Z >>> import torch.nn as nn 2025-12-04T10:03:46.0310870Z >>> import torch.nn.utils.parametrize as P 2025-12-04T10:03:46.0311110Z >>> 2025-12-04T10:03:46.0311266Z >>> class Symmetric(nn.Module): 2025-12-04T10:03:46.0311489Z >>> def forward(self, X): 2025-12-04T10:03:46.0311756Z >>> return X.triu() + X.triu(1).T # Return a symmetric matrix 2025-12-04T10:03:46.0312016Z >>> 2025-12-04T10:03:46.0312171Z >>> def right_inverse(self, A): 2025-12-04T10:03:46.0312386Z >>> return A.triu() 2025-12-04T10:03:46.0312567Z >>> 2025-12-04T10:03:46.0312714Z >>> m = nn.Linear(5, 5) 2025-12-04T10:03:46.0312960Z >>> P.register_parametrization(m, "weight", Symmetric()) 2025-12-04T10:03:46.0313320Z >>> print(torch.allclose(m.weight, m.weight.T)) # m.weight is now symmetric 2025-12-04T10:03:46.0313701Z True 2025-12-04T10:03:46.0313862Z >>> A = torch.rand(5, 5) 2025-12-04T10:03:46.0314065Z >>> A = A + A.T # A is now symmetric 2025-12-04T10:03:46.0314341Z >>> m.weight = A # Initialize the weight to be the symmetric matrix A 2025-12-04T10:03:46.0314639Z >>> print(torch.allclose(m.weight, A)) 2025-12-04T10:03:46.0314859Z True 2025-12-04T10:03:46.0314946Z 2025-12-04T10:03:46.0315019Z >>> class RankOne(nn.Module): 2025-12-04T10:03:46.0315233Z >>> def forward(self, x, y): 2025-12-04T10:03:46.0315474Z >>> # Form a rank 1 matrix multiplying two vectors 2025-12-04T10:03:46.0315748Z >>> return x.unsqueeze(-1) @ y.unsqueeze(-2) 2025-12-04T10:03:46.0315969Z >>> 2025-12-04T10:03:46.0316127Z >>> def right_inverse(self, Z): 2025-12-04T10:03:46.0316353Z >>> # Project Z onto the rank 1 matrices 2025-12-04T10:03:46.0316612Z >>> U, S, Vh = torch.linalg.svd(Z, full_matrices=False) 2025-12-04T10:03:46.0316875Z >>> # Return rescaled singular vectors 2025-12-04T10:03:46.0317111Z >>> s0_sqrt = S[0].sqrt().unsqueeze(-1) 2025-12-04T10:03:46.0317366Z >>> return U[..., :, 0] * s0_sqrt, Vh[..., 0, :] * s0_sqrt 2025-12-04T10:03:46.0317598Z >>> 2025-12-04T10:03:46.0317788Z >>> linear_rank_one = P.register_parametrization( 2025-12-04T10:03:46.0318046Z ... nn.Linear(4, 4), "weight", RankOne() 2025-12-04T10:03:46.0318272Z ... ) 2025-12-04T10:03:46.0318502Z >>> print(torch.linalg.matrix_rank(linear_rank_one.weight).item()) 2025-12-04T10:03:46.0318773Z 1 2025-12-04T10:03:46.0318854Z 2025-12-04T10:03:46.0318907Z 2025-12-04T10:03:46.0319350Z Original Error: IndentationError('expected an indented block after function definition on line 2', ('', 3, 0, '_._ = None\n', 3, -1)) 2025-12-04T10:03:46.0319781Z 2025-12-04T10:03:46.0319847Z _._ = None 2025-12-04T10:03:46.0319978Z ^ 2025-12-04T10:03:46.2299307Z msg = Cannot scrape callname=ReduceLROnPlateau in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py line=1586. 2025-12-04T10:03:46.2300158Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:46.2300637Z Reduce learning rate when a metric has stopped improving. 2025-12-04T10:03:46.2300897Z 2025-12-04T10:03:46.2301361Z Models often benefit from reducing the learning rate by a factor 2025-12-04T10:03:46.2301830Z of 2-10 once learning stagnates. This scheduler reads a metrics 2025-12-04T10:03:46.2302268Z quantity and if no improvement is seen for a 'patience' number 2025-12-04T10:03:46.2302643Z of epochs, the learning rate is reduced. 2025-12-04T10:03:46.2302843Z 2025-12-04T10:03:46.2302916Z Args: 2025-12-04T10:03:46.2303152Z optimizer (Optimizer): Wrapped optimizer. 2025-12-04T10:03:46.2303496Z mode (str): One of `min`, `max`. In `min` mode, lr will 2025-12-04T10:03:46.2303992Z be reduced when the quantity monitored has stopped 2025-12-04T10:03:46.2304357Z decreasing; in `max` mode it will be reduced when the 2025-12-04T10:03:46.2304747Z quantity monitored has stopped increasing. Default: 'min'. 2025-12-04T10:03:46.2305155Z factor (float): Factor by which the learning rate will be 2025-12-04T10:03:46.2305517Z reduced. new_lr = lr * factor. Default: 0.1. 2025-12-04T10:03:46.2305928Z patience (int): The number of allowed epochs with no improvement after 2025-12-04T10:03:46.2306320Z which the learning rate will be reduced. 2025-12-04T10:03:46.2306723Z For example, consider the case of having no patience (`patience = 0`). 2025-12-04T10:03:46.2307434Z In the first epoch, a baseline is established and is always considered good as there's no previous baseline. 2025-12-04T10:03:46.2308000Z In the second epoch, if the performance is worse than the baseline, 2025-12-04T10:03:46.2308613Z we have what is considered an intolerable epoch. 2025-12-04T10:03:46.2309075Z Since the count of intolerable epochs (1) is greater than the patience level (0), 2025-12-04T10:03:46.2309546Z the learning rate is reduced at the end of this epoch. 2025-12-04T10:03:46.2310049Z From the third epoch onwards, the learning rate continues to be reduced at the end of each epoch 2025-12-04T10:03:46.2310624Z if the performance is worse than the baseline. If the performance improves or remains the same, 2025-12-04T10:03:46.2310995Z the learning rate is not adjusted. 2025-12-04T10:03:46.2311221Z Default: 10. 2025-12-04T10:03:46.2311467Z threshold (float): Threshold for measuring the new optimum, 2025-12-04T10:03:46.2311794Z to only focus on significant changes. Default: 1e-4. 2025-12-04T10:03:46.2312099Z threshold_mode (str): One of `rel`, `abs`. In `rel` mode, 2025-12-04T10:03:46.2312400Z dynamic_threshold = best * ( 1 + threshold ) in 'max' 2025-12-04T10:03:46.2312687Z mode or best * ( 1 - threshold ) in `min` mode. 2025-12-04T10:03:46.2312969Z In `abs` mode, dynamic_threshold = best + threshold in 2025-12-04T10:03:46.2313268Z `max` mode or best - threshold in `min` mode. Default: 'rel'. 2025-12-04T10:03:46.2313587Z cooldown (int): Number of epochs to wait before resuming 2025-12-04T10:03:46.2313894Z normal operation after lr has been reduced. Default: 0. 2025-12-04T10:03:46.2314197Z min_lr (float or list): A scalar or a list of scalars. A 2025-12-04T10:03:46.2314481Z lower bound on the learning rate of all param groups 2025-12-04T10:03:46.2314753Z or each group respectively. Default: 0. 2025-12-04T10:03:46.2315039Z eps (float): Minimal decay applied to lr. If the difference 2025-12-04T10:03:46.2315349Z between new and old lr is smaller than eps, the update is 2025-12-04T10:03:46.2315638Z ignored. Default: 1e-8. 2025-12-04T10:03:46.2315773Z 2025-12-04T10:03:46.2315839Z Example: 2025-12-04T10:03:46.2315995Z >>> # xdoctest: +SKIP 2025-12-04T10:03:46.2316276Z >>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) 2025-12-04T10:03:46.2316622Z >>> scheduler = ReduceLROnPlateau(optimizer, "min") 2025-12-04T10:03:46.2316932Z >>> for epoch in range(10): 2025-12-04T10:03:46.2317139Z >>> train(...) 2025-12-04T10:03:46.2317328Z >>> val_loss = validate(...) 2025-12-04T10:03:46.2317581Z >>> # Note that step should be called after validate() 2025-12-04T10:03:46.2317826Z >>> scheduler.step(val_loss) 2025-12-04T10:03:46.2317974Z 2025-12-04T10:03:46.2318135Z .. image:: ../scripts/lr_scheduler_images/ReduceLROnPlateau.png 2025-12-04T10:03:46.2318407Z 2025-12-04T10:03:46.2318787Z Original Error: IndentationError('unexpected indent', ('', 8, 4, ' scheduler.step(val_loss)\n', 8, -1)) 2025-12-04T10:03:46.2319192Z 2025-12-04T10:03:46.2319273Z scheduler.step(val_loss) 2025-12-04T10:03:46.2319455Z ^ 2025-12-04T10:03:49.0389773Z running 894 test(s) 2025-12-04T10:03:49.0396017Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::typename:0, line 1111 <- wrt source file 2025-12-04T10:03:49.0404502Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::typename:0 2025-12-04T10:03:49.0405310Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::is_tensor:0, line 1142 <- wrt source file 2025-12-04T10:03:49.0409422Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::is_tensor:0 2025-12-04T10:03:49.0410326Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::is_storage:0, line 1157 <- wrt source file 2025-12-04T10:03:49.0417783Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::is_storage:0 2025-12-04T10:03:49.0418602Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::set_default_device:0, line 1247 <- wrt source file 2025-12-04T10:03:49.0420685Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::set_default_device:0 2025-12-04T10:03:49.0421527Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::set_default_tensor_type:0, line 1296 <- wrt source file 2025-12-04T10:03:49.0422700Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::set_default_tensor_type:0 2025-12-04T10:03:49.0423527Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::set_default_dtype:0, line 1333 <- wrt source file 2025-12-04T10:03:49.0426294Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::set_default_dtype:0 2025-12-04T10:03:49.0427582Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::use_deterministic_algorithms:0, line 1497 <- wrt source file 2025-12-04T10:03:49.0428756Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::use_deterministic_algorithms:0 2025-12-04T10:03:49.0429566Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::compile:0, line 2655 <- wrt source file 2025-12-04T10:03:49.0430306Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::compile:0 2025-12-04T10:03:49.0431344Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::_is_device_backend_autoload_enabled:0, line 2963 <- wrt source file 2025-12-04T10:03:49.0432295Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/__init__.py::_is_device_backend_autoload_enabled:0 2025-12-04T10:03:49.0433314Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_C.cpython-310-x86_64-linux-gnu.so::Generator:0, line 15 <- wrt source file 2025-12-04T10:03:49.0434506Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_C.cpython-310-x86_64-linux-gnu.so::Generator:0 2025-12-04T10:03:49.0435611Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_C.cpython-310-x86_64-linux-gnu.so::_LinAlgError:0, line 5 <- wrt source file 2025-12-04T10:03:49.0436581Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_C.cpython-310-x86_64-linux-gnu.so::_LinAlgError:0 2025-12-04T10:03:49.0437401Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_custom_ops.py::custom_op:0, line 55 <- wrt source file 2025-12-04T10:03:49.0438297Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_custom_ops.py::custom_op:0 2025-12-04T10:03:49.0439044Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_custom_ops.py::impl:0, line 138 <- wrt source file 2025-12-04T10:03:49.0439780Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_custom_ops.py::impl:0 2025-12-04T10:03:49.0440522Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_custom_ops.py::impl_abstract:0, line 208 <- wrt source file 2025-12-04T10:03:49.0973195Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_custom_ops.py::impl_abstract:0 2025-12-04T10:03:49.0974476Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_namedtensor_internals.py::update_names:0, line 118 <- wrt source file 2025-12-04T10:03:49.0975603Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_namedtensor_internals.py::update_names:0 2025-12-04T10:03:49.0976850Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.register_hook:0, line 681 <- wrt source file 2025-12-04T10:03:49.0989513Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.register_hook:0 2025-12-04T10:03:49.0990622Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.register_post_accumulate_grad_hook:0, line 738 <- wrt source file 2025-12-04T10:03:49.1010273Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.register_post_accumulate_grad_hook:0 2025-12-04T10:03:49.1011378Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.refine_names:0, line 1374 <- wrt source file 2025-12-04T10:03:49.1068893Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.refine_names:0 2025-12-04T10:03:49.1069920Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.align_to:0, line 1419 <- wrt source file 2025-12-04T10:03:49.1074062Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.align_to:0 2025-12-04T10:03:49.1074856Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.rename:0, line 1492 <- wrt source file 2025-12-04T10:03:49.1082144Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.rename:0 2025-12-04T10:03:49.1083163Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.to_sparse_coo:0, line 1522 <- wrt source file 2025-12-04T10:03:49.1092502Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.to_sparse_coo:0 2025-12-04T10:03:49.1093602Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.dim_order:0, line 1554 <- wrt source file 2025-12-04T10:03:49.1111543Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py::Tensor.dim_order:0 2025-12-04T10:03:49.1112751Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor_str.py::set_printoptions:0, line 53 <- wrt source file 2025-12-04T10:03:49.1128781Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor_str.py::set_printoptions:0 2025-12-04T10:03:49.1129868Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::broadcast_tensors:0, line 64 <- wrt source file 2025-12-04T10:03:49.1136347Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::broadcast_tensors:0 2025-12-04T10:03:49.1137578Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::broadcast_shapes:0, line 92 <- wrt source file 2025-12-04T10:03:49.1140505Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::broadcast_shapes:0 2025-12-04T10:03:49.1141537Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::split:0, line 144 <- wrt source file 2025-12-04T10:03:49.1154668Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::split:0 2025-12-04T10:03:49.1155893Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::einsum:0, line 258 <- wrt source file 2025-12-04T10:03:49.1173609Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::einsum:0 2025-12-04T10:03:49.1174563Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::meshgrid:0, line 450 <- wrt source file 2025-12-04T10:03:49.1214645Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::meshgrid:0 2025-12-04T10:03:49.1215620Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::_unique_impl:0, line 835 <- wrt source file 2025-12-04T10:03:49.1261362Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::_unique_impl:0 2025-12-04T10:03:49.1262409Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::_unique_consecutive_impl:0, line 992 <- wrt source file 2025-12-04T10:03:49.1274616Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::_unique_consecutive_impl:0 2025-12-04T10:03:49.1275640Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::tensordot:0, line 1267 <- wrt source file 2025-12-04T10:03:49.1286023Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::tensordot:0 2025-12-04T10:03:49.1287027Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::cartesian_prod:0, line 1351 <- wrt source file 2025-12-04T10:03:49.1294086Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::cartesian_prod:0 2025-12-04T10:03:49.1295054Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::block_diag:0, line 1385 <- wrt source file 2025-12-04T10:03:49.1304946Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::block_diag:0 2025-12-04T10:03:49.1305895Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::cdist:0, line 1441 <- wrt source file 2025-12-04T10:03:49.1320555Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::cdist:0 2025-12-04T10:03:49.1321509Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::atleast_1d:0, line 1482 <- wrt source file 2025-12-04T10:03:49.1339495Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::atleast_1d:0 2025-12-04T10:03:49.1340564Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::atleast_2d:0, line 1520 <- wrt source file 2025-12-04T10:03:49.1359960Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::atleast_2d:0 2025-12-04T10:03:49.1360909Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::atleast_3d:0, line 1560 <- wrt source file 2025-12-04T10:03:49.1384492Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::atleast_3d:0 2025-12-04T10:03:49.1385570Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::norm:0, line 1735 <- wrt source file 2025-12-04T10:03:49.1421524Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::norm:0 2025-12-04T10:03:49.1422503Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::unravel_index:0, line 1905 <- wrt source file 2025-12-04T10:03:49.1453169Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::unravel_index:0 2025-12-04T10:03:49.1454154Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::chain_matmul:0, line 2005 <- wrt source file 2025-12-04T10:03:49.1455141Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::chain_matmul:0 2025-12-04T10:03:49.1456610Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::_lu_impl:0, line 2106 <- wrt source file 2025-12-04T10:03:49.1458224Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/functional.py::_lu_impl:0 2025-12-04T10:03:49.1459419Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::list:0, line 477 <- wrt source file 2025-12-04T10:03:49.1460658Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::list:0 2025-12-04T10:03:49.1461661Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::help:0, line 537 <- wrt source file 2025-12-04T10:03:49.1462424Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::help:0 2025-12-04T10:03:49.1463092Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::load:0, line 628 <- wrt source file 2025-12-04T10:03:49.1463778Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::load:0 2025-12-04T10:03:49.1464470Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::_load_local:0, line 676 <- wrt source file 2025-12-04T10:03:49.1465437Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::_load_local:0 2025-12-04T10:03:49.1466200Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::download_url_to_file:0, line 711 <- wrt source file 2025-12-04T10:03:49.1467001Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::download_url_to_file:0 2025-12-04T10:03:49.1467909Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::load_state_dict_from_url:0, line 852 <- wrt source file 2025-12-04T10:03:49.1468706Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/hub.py::load_state_dict_from_url:0 2025-12-04T10:03:49.1469502Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::Library.define:0, line 145 <- wrt source file 2025-12-04T10:03:49.1470286Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::Library.define:0 2025-12-04T10:03:49.1471197Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::Library._impl_with_aoti_compile:0, line 239 <- wrt source file 2025-12-04T10:03:49.1475016Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::Library._impl_with_aoti_compile:0 2025-12-04T10:03:49.1475867Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::Library.impl:0, line 300 <- wrt source file 2025-12-04T10:03:49.1479840Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::Library.impl:0 2025-12-04T10:03:49.1480714Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::define:0, line 521 <- wrt source file 2025-12-04T10:03:49.1491739Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::define:0 2025-12-04T10:03:49.1492645Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::impl:0, line 627 <- wrt source file 2025-12-04T10:03:49.1507341Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::impl:0 2025-12-04T10:03:49.1508359Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::register_kernel:0, line 809 <- wrt source file 2025-12-04T10:03:49.1509289Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::register_kernel:0 2025-12-04T10:03:49.1510073Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::register_autocast:0, line 877 <- wrt source file 2025-12-04T10:03:49.1511006Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::register_autocast:0 2025-12-04T10:03:49.1511794Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::register_autograd:0, line 1164 <- wrt source file 2025-12-04T10:03:49.1665772Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::register_autograd:0 2025-12-04T10:03:49.1666806Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::register_torch_dispatch:0, line 1280 <- wrt source file 2025-12-04T10:03:49.1737653Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::register_torch_dispatch:0 2025-12-04T10:03:49.1738699Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::register_vmap:0, line 1369 <- wrt source file 2025-12-04T10:03:49.1880938Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::register_vmap:0 2025-12-04T10:03:49.1881915Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::opcheck:0, line 1694 <- wrt source file 2025-12-04T10:03:49.1882847Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py::opcheck:0 2025-12-04T10:03:49.1883813Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::get_ignored_functions:0, line 117 <- wrt source file 2025-12-04T10:03:49.1888041Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::get_ignored_functions:0 2025-12-04T10:03:49.1889102Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::get_testing_overrides:0, line 435 <- wrt source file 2025-12-04T10:03:49.1919897Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::get_testing_overrides:0 2025-12-04T10:03:49.1920959Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::wrap_torch_function:0, line 1589 <- wrt source file 2025-12-04T10:03:49.1924191Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::wrap_torch_function:0 2025-12-04T10:03:49.1926017Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::handle_torch_function:0, line 1725 <- wrt source file 2025-12-04T10:03:49.1927785Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::handle_torch_function:0 2025-12-04T10:03:49.1928859Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::is_tensor_method_or_property:0, line 1974 <- wrt source file 2025-12-04T10:03:49.1956052Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::is_tensor_method_or_property:0 2025-12-04T10:03:49.1957135Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::is_tensor_like:0, line 1993 <- wrt source file 2025-12-04T10:03:49.1963860Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/overrides.py::is_tensor_like:0 2025-12-04T10:03:49.1965004Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/quasirandom.py::SobolEngine:0, line 39 <- wrt source file 2025-12-04T10:03:49.1966214Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/quasirandom.py::SobolEngine:0 2025-12-04T10:03:49.1967277Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::add_safe_globals:0, line 300 <- wrt source file 2025-12-04T10:03:49.1968470Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::add_safe_globals:0 2025-12-04T10:03:49.1969935Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::safe_globals:0, line 325 <- wrt source file 2025-12-04T10:03:49.1970980Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::safe_globals:0 2025-12-04T10:03:49.1971877Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::skip_data:0, line 401 <- wrt source file 2025-12-04T10:03:49.1972682Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::skip_data:0 2025-12-04T10:03:49.1973479Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::register_package:0, line 473 <- wrt source file 2025-12-04T10:03:49.1975713Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::register_package:0 2025-12-04T10:03:49.1976587Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::save:0, line 960 <- wrt source file 2025-12-04T10:03:49.1977376Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::save:0 2025-12-04T10:03:49.1978133Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::load:0, line 1379 <- wrt source file 2025-12-04T10:03:49.1982151Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/serialization.py::load:0 2025-12-04T10:03:49.1983326Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/torch_version.py::TorchVersion:0, line 19 <- wrt source file 2025-12-04T10:03:49.1984363Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/torch_version.py::TorchVersion:0 2025-12-04T10:03:49.1985458Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/__init__.py::list_mode_options:0, line 349 <- wrt source file 2025-12-04T10:03:49.1986721Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/__init__.py::list_mode_options:0 2025-12-04T10:03:49.1987993Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/__init__.py::list_options:0, line 388 <- wrt source file 2025-12-04T10:03:49.2000385Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/__init__.py::list_options:0 2025-12-04T10:03:49.2001317Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_prims_common/__init__.py::compute_required_storage_length:0, line 1911 <- wrt source file 2025-12-04T10:03:49.2006618Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_prims_common/__init__.py::compute_required_storage_length:0 2025-12-04T10:03:49.2007895Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/accelerator/__init__.py::current_accelerator:0, line 117 <- wrt source file 2025-12-04T10:03:49.3009926Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/accelerator/__init__.py::current_accelerator:0 2025-12-04T10:03:49.3011415Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/accelerator/__init__.py::get_device_capability:0, line 171 <- wrt source file 2025-12-04T10:03:49.3012479Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/accelerator/__init__.py::get_device_capability:0 2025-12-04T10:03:49.3013376Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/accelerator/__init__.py::device_index:0, line 276 <- wrt source file 2025-12-04T10:03:49.3014228Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/accelerator/__init__.py::device_index:0 2025-12-04T10:03:49.3015352Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::allow_in_graph:0, line 130 <- wrt source file 2025-12-04T10:03:49.3016192Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::allow_in_graph:0 2025-12-04T10:03:49.3017034Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::substitute_in_graph:0, line 186 <- wrt source file 2025-12-04T10:03:49.6533481Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::substitute_in_graph:0 2025-12-04T10:03:49.6534642Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::wrap_numpy:0, line 416 <- wrt source file 2025-12-04T10:03:49.6535733Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::wrap_numpy:0 2025-12-04T10:03:49.6536996Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::is_compiling:0, line 448 <- wrt source file 2025-12-04T10:03:49.6538118Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::is_compiling:0 2025-12-04T10:03:49.6546879Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::is_dynamo_compiling:0, line 469 <- wrt source file 2025-12-04T10:03:49.6548418Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::is_dynamo_compiling:0 2025-12-04T10:03:49.6549473Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::is_exporting:0, line 487 <- wrt source file 2025-12-04T10:03:49.6550558Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::is_exporting:0 2025-12-04T10:03:49.6551689Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::save_cache_artifacts:0, line 502 <- wrt source file 2025-12-04T10:03:49.6552629Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::save_cache_artifacts:0 2025-12-04T10:03:49.6554015Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::load_cache_artifacts:0, line 522 <- wrt source file 2025-12-04T10:03:49.6555418Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/compiler/__init__.py::load_cache_artifacts:0 2025-12-04T10:03:49.6556280Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py::_compile_kernel:0, line 1788 <- wrt source file 2025-12-04T10:03:49.6557269Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py::_compile_kernel:0 2025-12-04T10:03:49.6558348Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/__init__.py::save:0, line 349 <- wrt source file 2025-12-04T10:03:49.6559201Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/__init__.py::save:0 2025-12-04T10:03:49.6560174Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/__init__.py::load:0, line 422 <- wrt source file 2025-12-04T10:03:49.6561297Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/__init__.py::load:0 2025-12-04T10:03:49.6562105Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/__init__.py::register_dataclass:0, line 581 <- wrt source file 2025-12-04T10:03:49.6562951Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/__init__.py::register_dataclass:0 2025-12-04T10:03:49.6563886Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/futures/__init__.py::Future.then:0, line 152 <- wrt source file 2025-12-04T10:03:49.6564783Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/futures/__init__.py::Future.then:0 2025-12-04T10:03:49.6565620Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/futures/__init__.py::Future.add_done_callback:0, line 201 <- wrt source file 2025-12-04T10:03:49.6566514Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/futures/__init__.py::Future.add_done_callback:0 2025-12-04T10:03:49.6567352Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/futures/__init__.py::Future.set_result:0, line 235 <- wrt source file 2025-12-04T10:03:49.6568188Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/futures/__init__.py::Future.set_result:0 2025-12-04T10:03:49.6569062Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/futures/__init__.py::Future.set_exception:0, line 265 <- wrt source file 2025-12-04T10:03:49.6569955Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/futures/__init__.py::Future.set_exception:0 2025-12-04T10:03:49.6570817Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/futures/__init__.py::collect_all:0, line 299 <- wrt source file 2025-12-04T10:03:49.6571634Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/futures/__init__.py::collect_all:0 2025-12-04T10:03:49.6572391Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/__init__.py::annotate:0, line 147 <- wrt source file 2025-12-04T10:03:49.6573146Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/__init__.py::annotate:0 2025-12-04T10:03:49.6573962Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/monitor/__init__.py::TensorboardEventHandler:0, line 22 <- wrt source file 2025-12-04T10:03:49.6579598Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/monitor/__init__.py::TensorboardEventHandler:0 2025-12-04T10:03:49.6580486Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/mps/__init__.py::compile_shader:0, line 148 <- wrt source file 2025-12-04T10:03:49.6581422Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/mps/__init__.py::compile_shader:0 2025-12-04T10:03:49.6582454Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py::as_nested_tensor:0, line 61 <- wrt source file 2025-12-04T10:03:49.6602702Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py::as_nested_tensor:0 2025-12-04T10:03:49.6603792Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py::nested_tensor:0, line 240 <- wrt source file 2025-12-04T10:03:49.6608320Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py::nested_tensor:0 2025-12-04T10:03:49.6609340Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py::narrow:0, line 315 <- wrt source file 2025-12-04T10:03:49.6655823Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py::narrow:0 2025-12-04T10:03:49.6656926Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py::nested_tensor_from_jagged:0, line 405 <- wrt source file 2025-12-04T10:03:49.6663064Z W1204 10:03:49.665000 1844 site-packages/torch/fx/_symbolic_trace.py:53] is_fx_tracing will return true for both fx.symbolic_trace and torch.export. Please use is_fx_tracing_symbolic_tracing() for specifically fx.symbolic_trace or torch.compiler.is_compiling() for specifically torch.export/compile. 2025-12-04T10:03:49.6680936Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py::nested_tensor_from_jagged:0 2025-12-04T10:03:49.6682065Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py::masked_select:0, line 481 <- wrt source file 2025-12-04T10:03:49.6699783Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py::masked_select:0 2025-12-04T10:03:49.6700814Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/sparse/__init__.py::sum:0, line 223 <- wrt source file 2025-12-04T10:03:49.6711805Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/sparse/__init__.py::sum:0 2025-12-04T10:03:49.6712990Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/sparse/__init__.py::check_sparse_tensor_invariants:0, line 475 <- wrt source file 2025-12-04T10:03:49.6721239Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/sparse/__init__.py::check_sparse_tensor_invariants:0 2025-12-04T10:03:49.6722398Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/sparse/__init__.py::as_sparse_gradcheck:0, line 561 <- wrt source file 2025-12-04T10:03:49.6770652Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/sparse/__init__.py::as_sparse_gradcheck:0 2025-12-04T10:03:49.6771806Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/decorators.py::substitute_in_graph:0, line 361 <- wrt source file 2025-12-04T10:03:49.6774918Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/decorators.py::substitute_in_graph:0 2025-12-04T10:03:49.6776151Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/base.py::VariableTracker.python_type:0, line 328 <- wrt source file 2025-12-04T10:03:49.6777445Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/base.py::VariableTracker.python_type:0 2025-12-04T10:03:49.6778958Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/higher_order_ops.py::speculate_subgraph_with_auto_output_flattening:0, line 1316 <- wrt source file 2025-12-04T10:03:49.6780496Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_dynamo/variables/higher_order_ops.py::speculate_subgraph_with_auto_output_flattening:0 2025-12-04T10:03:49.6781847Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_export/utils.py::register_module_as_pytree_input_node:0, line 1441 <- wrt source file 2025-12-04T10:03:49.6783095Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_export/utils.py::register_module_as_pytree_input_node:0 2025-12-04T10:03:49.6784501Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_export/wrappers.py::mark_subclass_constructor_exportable_experimental:0, line 194 <- wrt source file 2025-12-04T10:03:49.6785892Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_export/wrappers.py::mark_subclass_constructor_exportable_experimental:0 2025-12-04T10:03:49.6787144Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_export/wrappers.py::allow_in_pre_dispatch_graph:0, line 262 <- wrt source file 2025-12-04T10:03:49.6788473Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_export/wrappers.py::allow_in_pre_dispatch_graph:0 2025-12-04T10:03:49.6789609Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py::aot_function:0, line 771 <- wrt source file 2025-12-04T10:03:49.7071864Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/aot_autograd.py::aot_function:0 2025-12-04T10:03:49.7073093Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/benchmark_utils.py::benchmark_utilization:0, line 184 <- wrt source file 2025-12-04T10:03:49.7074314Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/benchmark_utils.py::benchmark_utilization:0 2025-12-04T10:03:49.7075432Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::vjp:0, line 234 <- wrt source file 2025-12-04T10:03:49.7111817Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::vjp:0 2025-12-04T10:03:49.7112873Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::jacrev:0, line 476 <- wrt source file 2025-12-04T10:03:49.7172759Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::jacrev:0 2025-12-04T10:03:49.7173817Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::jvp:0, line 1024 <- wrt source file 2025-12-04T10:03:49.7943451Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::jvp:0 2025-12-04T10:03:49.7944558Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::jacfwd:0, line 1182 <- wrt source file 2025-12-04T10:03:49.8007380Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::jacfwd:0 2025-12-04T10:03:49.8008467Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::hessian:0, line 1342 <- wrt source file 2025-12-04T10:03:49.8026398Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::hessian:0 2025-12-04T10:03:49.8027636Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::functionalize:0, line 1506 <- wrt source file 2025-12-04T10:03:49.8031254Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::functionalize:0 2025-12-04T10:03:49.8032331Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::linearize:0, line 1705 <- wrt source file 2025-12-04T10:03:49.8189271Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/eager_transforms.py::linearize:0 2025-12-04T10:03:49.8190416Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/functional_call.py::functional_call:0, line 36 <- wrt source file 2025-12-04T10:03:49.8194032Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/functional_call.py::functional_call:0 2025-12-04T10:03:49.8195301Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/fx_minifier.py::minifier:0, line 194 <- wrt source file 2025-12-04T10:03:49.8196753Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/fx_minifier.py::minifier:0 2025-12-04T10:03:49.8198345Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/schemas.py::CompilerWrapper.post_compile:0, line 1111 <- wrt source file 2025-12-04T10:03:49.8199679Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/schemas.py::CompilerWrapper.post_compile:0 2025-12-04T10:03:49.8201105Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/schemas.py::InductorWrapper.post_compile:0, line 1166 <- wrt source file 2025-12-04T10:03:49.8202710Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/_aot_autograd/schemas.py::InductorWrapper.post_compile:0 2025-12-04T10:03:49.8203840Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/associative_scan.py::associative_scan:0, line 183 <- wrt source file 2025-12-04T10:03:49.8205103Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/associative_scan.py::associative_scan:0 2025-12-04T10:03:49.8206480Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/associative_scan.py::generic_associative_scan:0, line 319 <- wrt source file 2025-12-04T10:03:49.8207765Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/associative_scan.py::generic_associative_scan:0 2025-12-04T10:03:49.8208732Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/cond.py::cond:0, line 139 <- wrt source file 2025-12-04T10:03:49.8209762Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/cond.py::cond:0 2025-12-04T10:03:49.8210816Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/flat_apply.py::FlatApply.__call__:0, line 80 <- wrt source file 2025-12-04T10:03:49.8211933Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/flat_apply.py::FlatApply.__call__:0 2025-12-04T10:03:49.8213067Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/map.py::map:0, line 80 <- wrt source file 2025-12-04T10:03:49.8213947Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/map.py::map:0 2025-12-04T10:03:49.8214891Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/partitioner.py::HopPartitionedGraph._reorder_fw_output:0, line 133 <- wrt source file 2025-12-04T10:03:49.8215989Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/partitioner.py::HopPartitionedGraph._reorder_fw_output:0 2025-12-04T10:03:49.8217003Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/scan.py::scan:0, line 130 <- wrt source file 2025-12-04T10:03:49.8217825Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_higher_order_ops/scan.py::scan:0 2025-12-04T10:03:49.8218655Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py::WritableTempFile:0, line 385 <- wrt source file 2025-12-04T10:03:49.8219582Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/codecache.py::WritableTempFile:0 2025-12-04T10:03:49.8220508Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/cpp_builder.py::get_name_and_dir_from_output_file_path:0, line 1845 <- wrt source file 2025-12-04T10:03:49.8221522Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/cpp_builder.py::get_name_and_dir_from_output_file_path:0 2025-12-04T10:03:49.8222473Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py::add_preprocessing_fn:0, line 4328 <- wrt source file 2025-12-04T10:03:49.8223414Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/select_algorithm.py::add_preprocessing_fn:0 2025-12-04T10:03:49.8224317Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/fx_passes/graph_view.py::_clean_stack_name:0, line 100 <- wrt source file 2025-12-04T10:03:49.8225332Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/fx_passes/graph_view.py::_clean_stack_name:0 2025-12-04T10:03:49.8226245Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/kernel/custom_op.py::CustomOpConfig:0, line 56 <- wrt source file 2025-12-04T10:03:49.8227143Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/kernel/custom_op.py::CustomOpConfig:0 2025-12-04T10:03:49.8228191Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/kernel/custom_op.py::register_custom_op_autotuning:0, line 423 <- wrt source file 2025-12-04T10:03:49.8229191Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/kernel/custom_op.py::register_custom_op_autotuning:0 2025-12-04T10:03:49.8230173Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/caching/locks.py::_acquire_lock_with_timeout:0, line 69 <- wrt source file 2025-12-04T10:03:49.8231175Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/caching/locks.py::_acquire_lock_with_timeout:0 2025-12-04T10:03:49.8232189Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/caching/locks.py::_unsafe_acquire_lock_with_timeout:0, line 105 <- wrt source file 2025-12-04T10:03:49.8233233Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/caching/locks.py::_unsafe_acquire_lock_with_timeout:0 2025-12-04T10:03:49.8234243Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/caching/locks.py::_acquire_flock_with_timeout:0, line 142 <- wrt source file 2025-12-04T10:03:49.8235254Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/caching/locks.py::_acquire_flock_with_timeout:0 2025-12-04T10:03:49.8236269Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/caching/locks.py::_unsafe_acquire_flock_with_timeout:0, line 179 <- wrt source file 2025-12-04T10:03:49.8237375Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/runtime/caching/locks.py::_unsafe_acquire_flock_with_timeout:0 2025-12-04T10:03:49.8238427Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/template_heuristics/registry.py::register_template_heuristic:0, line 54 <- wrt source file 2025-12-04T10:03:49.8239491Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_inductor/template_heuristics/registry.py::register_template_heuristic:0 2025-12-04T10:03:49.8240391Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::custom_op:0, line 101 <- wrt source file 2025-12-04T10:03:49.8520000Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::custom_op:0 2025-12-04T10:03:49.8521134Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::CustomOpDef.set_kernel_enabled:0, line 241 <- wrt source file 2025-12-04T10:03:49.8599702Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::CustomOpDef.set_kernel_enabled:0 2025-12-04T10:03:49.8600941Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::CustomOpDef.register_kernel:0, line 310 <- wrt source file 2025-12-04T10:03:49.8602156Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::CustomOpDef.register_kernel:0 2025-12-04T10:03:49.8603430Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::CustomOpDef.register_autograd:0, line 549 <- wrt source file 2025-12-04T10:03:49.8752828Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::CustomOpDef.register_autograd:0 2025-12-04T10:03:49.8754022Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::CustomOpDef.register_vmap:0, line 724 <- wrt source file 2025-12-04T10:03:49.8905968Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::CustomOpDef.register_vmap:0 2025-12-04T10:03:49.8907636Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::CustomOpDef.register_autocast:0, line 810 <- wrt source file 2025-12-04T10:03:49.8909066Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py::CustomOpDef.register_autocast:0 2025-12-04T10:03:49.8910516Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/fake_class_registry.py::register_fake_class:0, line 273 <- wrt source file 2025-12-04T10:03:49.8912023Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/fake_class_registry.py::register_fake_class:0 2025-12-04T10:03:49.8913100Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/fake_impl.py::FakeImplCtx.new_dynamic_size:0, line 175 <- wrt source file 2025-12-04T10:03:49.8976761Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/fake_impl.py::FakeImplCtx.new_dynamic_size:0 2025-12-04T10:03:49.8977907Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/infer_schema.py::infer_schema:0, line 53 <- wrt source file 2025-12-04T10:03:49.8983445Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/infer_schema.py::infer_schema:0 2025-12-04T10:03:49.8984665Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/triton.py::triton_op:0, line 136 <- wrt source file 2025-12-04T10:03:49.8985970Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/triton.py::triton_op:0 2025-12-04T10:03:49.8987419Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/triton.py::wrap_triton:0, line 307 <- wrt source file 2025-12-04T10:03:49.8988509Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/triton.py::wrap_triton:0 2025-12-04T10:03:49.8989492Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_logging/_internal.py::set_logs:0, line 460 <- wrt source file 2025-12-04T10:03:49.8990504Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_logging/_internal.py::set_logs:0 2025-12-04T10:03:49.8991627Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_equal:0, line 171 <- wrt source file 2025-12-04T10:03:49.9024162Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_equal:0 2025-12-04T10:03:49.9025300Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::print_assert_equal:0, line 302 <- wrt source file 2025-12-04T10:03:49.9026428Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::print_assert_equal:0 2025-12-04T10:03:49.9027600Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_almost_equal:0, line 375 <- wrt source file 2025-12-04T10:03:49.9069796Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_almost_equal:0 2025-12-04T10:03:49.9071112Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_approx_equal:0, line 496 <- wrt source file 2025-12-04T10:03:49.9073398Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_approx_equal:0 2025-12-04T10:03:49.9074501Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_array_equal:0, line 793 <- wrt source file 2025-12-04T10:03:49.9133636Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_array_equal:0 2025-12-04T10:03:49.9134755Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_array_almost_equal:0, line 899 <- wrt source file 2025-12-04T10:03:49.9193629Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_array_almost_equal:0 2025-12-04T10:03:49.9194771Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_array_less:0, line 1008 <- wrt source file 2025-12-04T10:03:49.9246016Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_array_less:0 2025-12-04T10:03:49.9247289Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_string_equal:0, line 1073 <- wrt source file 2025-12-04T10:03:49.9248716Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_string_equal:0 2025-12-04T10:03:49.9249958Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_allclose:0, line 1294 <- wrt source file 2025-12-04T10:03:49.9265280Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_allclose:0 2025-12-04T10:03:49.9266346Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_array_almost_equal_nulp:0, line 1360 <- wrt source file 2025-12-04T10:03:49.9269775Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_array_almost_equal_nulp:0 2025-12-04T10:03:49.9270881Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_array_max_ulp:0, line 1423 <- wrt source file 2025-12-04T10:03:49.9274144Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_array_max_ulp:0 2025-12-04T10:03:49.9275197Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::nulp_diff:0, line 1468 <- wrt source file 2025-12-04T10:03:49.9276386Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::nulp_diff:0 2025-12-04T10:03:49.9277371Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_warns:0, line 1578 <- wrt source file 2025-12-04T10:03:49.9280247Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::assert_warns:0 2025-12-04T10:03:49.9281260Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::clear_and_catch_warnings:0, line 1881 <- wrt source file 2025-12-04T10:03:49.9283210Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_numpy/testing/utils.py::clear_and_catch_warnings:0 2025-12-04T10:03:49.9284342Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_prims/context.py::TorchRefsMode:0, line 95 <- wrt source file 2025-12-04T10:03:49.9285613Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_prims/context.py::TorchRefsMode:0 2025-12-04T10:03:49.9286671Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_subclasses/complex_tensor/_ops/common.py::is_complex_tensor:0, line 47 <- wrt source file 2025-12-04T10:03:49.9287965Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_subclasses/complex_tensor/_ops/common.py::is_complex_tensor:0 2025-12-04T10:03:49.9289039Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/amp/grad_scaler.py::GradScaler:0, line 64 <- wrt source file 2025-12-04T10:03:49.9290026Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/amp/grad_scaler.py::GradScaler:0 2025-12-04T10:03:49.9290936Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/intrinsic/qat/modules/linear_relu.py::LinearReLU:0, line 34 <- wrt source file 2025-12-04T10:03:49.9291981Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/intrinsic/qat/modules/linear_relu.py::LinearReLU:0 2025-12-04T10:03:49.9293022Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/intrinsic/quantized/dynamic/modules/linear_relu.py::LinearReLU:0, line 24 <- wrt source file 2025-12-04T10:03:49.9294136Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/intrinsic/quantized/dynamic/modules/linear_relu.py::LinearReLU:0 2025-12-04T10:03:49.9295187Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/intrinsic/quantized/modules/linear_relu.py::LinearReLU:0, line 25 <- wrt source file 2025-12-04T10:03:49.9296219Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/intrinsic/quantized/modules/linear_relu.py::LinearReLU:0 2025-12-04T10:03:49.9297262Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/intrinsic/quantized/modules/linear_relu.py::LinearLeakyReLU:0, line 67 <- wrt source file 2025-12-04T10:03:49.9298335Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/intrinsic/quantized/modules/linear_relu.py::LinearLeakyReLU:0 2025-12-04T10:03:49.9299487Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/intrinsic/quantized/modules/linear_relu.py::LinearTanh:0, line 142 <- wrt source file 2025-12-04T10:03:49.9300904Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/intrinsic/quantized/modules/linear_relu.py::LinearTanh:0 2025-12-04T10:03:49.9302153Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantizable/modules/rnn.py::LSTMCell:0, line 29 <- wrt source file 2025-12-04T10:03:49.9310996Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantizable/modules/rnn.py::LSTMCell:0 2025-12-04T10:03:49.9312170Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantizable/modules/rnn.py::LSTM:0, line 413 <- wrt source file 2025-12-04T10:03:49.9340082Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantizable/modules/rnn.py::LSTM:0 2025-12-04T10:03:49.9341241Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/functional.py::conv1d:0, line 210 <- wrt source file 2025-12-04T10:03:49.9342379Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/functional.py::conv1d:0 2025-12-04T10:03:49.9343473Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/functional.py::conv2d:0, line 282 <- wrt source file 2025-12-04T10:03:49.9344678Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/functional.py::conv2d:0 2025-12-04T10:03:49.9345826Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/functional.py::conv3d:0, line 358 <- wrt source file 2025-12-04T10:03:49.9346956Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/functional.py::conv3d:0 2025-12-04T10:03:49.9348227Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/__init__.py::Quantize:0, line 95 <- wrt source file 2025-12-04T10:03:49.9350229Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/__init__.py::Quantize:0 2025-12-04T10:03:49.9351397Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/__init__.py::DeQuantize:0, line 145 <- wrt source file 2025-12-04T10:03:49.9356535Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/__init__.py::DeQuantize:0 2025-12-04T10:03:49.9357793Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::Conv1d:0, line 43 <- wrt source file 2025-12-04T10:03:49.9359065Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::Conv1d:0 2025-12-04T10:03:49.9360271Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::Conv2d:0, line 126 <- wrt source file 2025-12-04T10:03:49.9361510Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::Conv2d:0 2025-12-04T10:03:49.9362718Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::Conv3d:0, line 212 <- wrt source file 2025-12-04T10:03:49.9363962Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::Conv3d:0 2025-12-04T10:03:49.9365214Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::ConvTranspose1d:0, line 300 <- wrt source file 2025-12-04T10:03:49.9366702Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::ConvTranspose1d:0 2025-12-04T10:03:49.9368020Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::ConvTranspose2d:0, line 383 <- wrt source file 2025-12-04T10:03:49.9369352Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::ConvTranspose2d:0 2025-12-04T10:03:49.9370641Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::ConvTranspose3d:0, line 466 <- wrt source file 2025-12-04T10:03:49.9372064Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/conv.py::ConvTranspose3d:0 2025-12-04T10:03:49.9373328Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/linear.py::Linear:0, line 30 <- wrt source file 2025-12-04T10:03:49.9374587Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/linear.py::Linear:0 2025-12-04T10:03:49.9375780Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/rnn.py::LSTM:0, line 516 <- wrt source file 2025-12-04T10:03:49.9376986Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/rnn.py::LSTM:0 2025-12-04T10:03:49.9378292Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/rnn.py::GRU:0, line 803 <- wrt source file 2025-12-04T10:03:49.9379508Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/rnn.py::GRU:0 2025-12-04T10:03:49.9380701Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/rnn.py::RNNCell:0, line 1209 <- wrt source file 2025-12-04T10:03:49.9381925Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/rnn.py::RNNCell:0 2025-12-04T10:03:49.9383129Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/rnn.py::LSTMCell:0, line 1276 <- wrt source file 2025-12-04T10:03:49.9384370Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/rnn.py::LSTMCell:0 2025-12-04T10:03:49.9385589Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/rnn.py::GRUCell:0, line 1329 <- wrt source file 2025-12-04T10:03:49.9386819Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/dynamic/modules/rnn.py::GRUCell:0 2025-12-04T10:03:49.9388142Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/activation.py::ReLU6:0, line 36 <- wrt source file 2025-12-04T10:03:49.9389359Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/activation.py::ReLU6:0 2025-12-04T10:03:49.9390517Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::Conv1d:0, line 376 <- wrt source file 2025-12-04T10:03:49.9391677Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::Conv1d:0 2025-12-04T10:03:49.9392798Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::Conv2d:0, line 506 <- wrt source file 2025-12-04T10:03:49.9394059Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::Conv2d:0 2025-12-04T10:03:49.9395240Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::Conv3d:0, line 636 <- wrt source file 2025-12-04T10:03:49.9396394Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::Conv3d:0 2025-12-04T10:03:49.9397555Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::ConvTranspose1d:0, line 893 <- wrt source file 2025-12-04T10:03:49.9398800Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::ConvTranspose1d:0 2025-12-04T10:03:49.9400076Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::ConvTranspose2d:0, line 1015 <- wrt source file 2025-12-04T10:03:49.9401317Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::ConvTranspose2d:0 2025-12-04T10:03:49.9402538Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::ConvTranspose3d:0, line 1141 <- wrt source file 2025-12-04T10:03:49.9403770Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/conv.py::ConvTranspose3d:0 2025-12-04T10:03:49.9404990Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/embedding_ops.py::Embedding:0, line 111 <- wrt source file 2025-12-04T10:03:49.9406363Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/embedding_ops.py::Embedding:0 2025-12-04T10:03:49.9407622Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/embedding_ops.py::EmbeddingBag:0, line 275 <- wrt source file 2025-12-04T10:03:49.9408921Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/embedding_ops.py::EmbeddingBag:0 2025-12-04T10:03:49.9410230Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/functional_modules.py::FloatFunctional:0, line 23 <- wrt source file 2025-12-04T10:03:49.9411624Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/functional_modules.py::FloatFunctional:0 2025-12-04T10:03:49.9412953Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/functional_modules.py::QFunctional:0, line 176 <- wrt source file 2025-12-04T10:03:49.9414304Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/functional_modules.py::QFunctional:0 2025-12-04T10:03:49.9415519Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/linear.py::Linear:0, line 135 <- wrt source file 2025-12-04T10:03:49.9416696Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/linear.py::Linear:0 2025-12-04T10:03:49.9417801Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/rnn.py::LSTM:0, line 24 <- wrt source file 2025-12-04T10:03:49.9418914Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/nn/quantized/modules/rnn.py::LSTM:0 2025-12-04T10:03:49.9420310Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/_experimental/data_scheduler/base_data_scheduler.py::BaseDataScheduler.get_schedule_param:0, line 98 <- wrt source file 2025-12-04T10:03:49.9452210Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/_experimental/data_scheduler/base_data_scheduler.py::BaseDataScheduler.get_schedule_param:0 2025-12-04T10:03:49.9453902Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/_experimental/data_sparsifier/base_data_sparsifier.py::BaseDataSparsifier:0, line 55 <- wrt source file 2025-12-04T10:03:49.9455720Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/_experimental/data_sparsifier/base_data_sparsifier.py::BaseDataSparsifier:0 2025-12-04T10:03:49.9457137Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/scheduler/lambda_scheduler.py::LambdaSL:0, line 24 <- wrt source file 2025-12-04T10:03:49.9458486Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/scheduler/lambda_scheduler.py::LambdaSL:0 2025-12-04T10:03:49.9459476Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/sparsifier/base_sparsifier.py::BaseSparsifier:0, line 47 <- wrt source file 2025-12-04T10:03:49.9460488Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/sparsifier/base_sparsifier.py::BaseSparsifier:0 2025-12-04T10:03:49.9461543Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/sparsifier/base_sparsifier.py::BaseSparsifier.squash_mask:0, line 251 <- wrt source file 2025-12-04T10:03:49.9462642Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/sparsifier/base_sparsifier.py::BaseSparsifier.squash_mask:0 2025-12-04T10:03:49.9463690Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fuse_modules.py::fuse_modules:0, line 175 <- wrt source file 2025-12-04T10:03:49.9464661Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fuse_modules.py::fuse_modules:0 2025-12-04T10:03:49.9465584Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fuser_method_mappings.py::fuse_conv_bn:0, line 32 <- wrt source file 2025-12-04T10:03:49.9474108Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fuser_method_mappings.py::fuse_conv_bn:0 2025-12-04T10:03:49.9475399Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fuser_method_mappings.py::fuse_conv_bn_relu:0, line 83 <- wrt source file 2025-12-04T10:03:49.9480923Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fuser_method_mappings.py::fuse_conv_bn_relu:0 2025-12-04T10:03:49.9482187Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fuser_method_mappings.py::fuse_linear_bn:0, line 143 <- wrt source file 2025-12-04T10:03:49.9486740Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fuser_method_mappings.py::fuse_linear_bn:0 2025-12-04T10:03:49.9488080Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fuser_method_mappings.py::fuse_convtranspose_bn:0, line 182 <- wrt source file 2025-12-04T10:03:49.9492635Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fuser_method_mappings.py::fuse_convtranspose_bn:0 2025-12-04T10:03:49.9493594Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/observer.py::_with_args:0, line 110 <- wrt source file 2025-12-04T10:03:49.9494481Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/observer.py::_with_args:0 2025-12-04T10:03:49.9495366Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/observer.py::_with_callable_args:0, line 132 <- wrt source file 2025-12-04T10:03:49.9496440Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/observer.py::_with_callable_args:0 2025-12-04T10:03:49.9497323Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::fuse_fx:0, line 218 <- wrt source file 2025-12-04T10:03:49.9498185Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::fuse_fx:0 2025-12-04T10:03:49.9499520Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::prepare_fx:0, line 288 <- wrt source file 2025-12-04T10:03:49.9501076Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::prepare_fx:0 2025-12-04T10:03:49.9502523Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::prepare_qat_fx:0, line 427 <- wrt source file 2025-12-04T10:03:49.9503811Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::prepare_qat_fx:0 2025-12-04T10:03:49.9504944Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::convert_fx:0, line 608 <- wrt source file 2025-12-04T10:03:49.9506150Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::convert_fx:0 2025-12-04T10:03:49.9507421Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::convert_to_reference_fx:0, line 668 <- wrt source file 2025-12-04T10:03:49.9508991Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::convert_to_reference_fx:0 2025-12-04T10:03:49.9510349Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::_convert_to_reference_decomposed_fx:0, line 720 <- wrt source file 2025-12-04T10:03:49.9511444Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_fx.py::_convert_to_reference_decomposed_fx:0 2025-12-04T10:03:49.9512433Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_pt2e.py::prepare_pt2e:0, line 51 <- wrt source file 2025-12-04T10:03:49.9513354Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_pt2e.py::prepare_pt2e:0 2025-12-04T10:03:49.9514277Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_pt2e.py::prepare_qat_pt2e:0, line 130 <- wrt source file 2025-12-04T10:03:49.9515644Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_pt2e.py::prepare_qat_pt2e:0 2025-12-04T10:03:49.9516970Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_pt2e.py::convert_pt2e:0, line 228 <- wrt source file 2025-12-04T10:03:49.9518060Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/quantize_pt2e.py::convert_pt2e:0 2025-12-04T10:03:49.9519258Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::get_combined_dict:0, line 171 <- wrt source file 2025-12-04T10:03:49.9520352Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::get_combined_dict:0 2025-12-04T10:03:49.9521246Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::_get_path_of_module:0, line 553 <- wrt source file 2025-12-04T10:03:49.9522233Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::_get_path_of_module:0 2025-12-04T10:03:49.9523724Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::_get_signature_locals:0, line 575 <- wrt source file 2025-12-04T10:03:49.9524740Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::_get_signature_locals:0 2025-12-04T10:03:49.9525642Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::_get_default_kwargs:0, line 589 <- wrt source file 2025-12-04T10:03:49.9526559Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::_get_default_kwargs:0 2025-12-04T10:03:49.9527636Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::_normalize_kwargs:0, line 611 <- wrt source file 2025-12-04T10:03:49.9528866Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::_normalize_kwargs:0 2025-12-04T10:03:49.9530155Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::_get_num_pos_args:0, line 738 <- wrt source file 2025-12-04T10:03:49.9531223Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/utils.py::_get_num_pos_args:0 2025-12-04T10:03:49.9532693Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/backend_config/backend_config.py::DTypeConfig:0, line 216 <- wrt source file 2025-12-04T10:03:49.9534217Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/backend_config/backend_config.py::DTypeConfig:0 2025-12-04T10:03:49.9535769Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/backend_config/onednn.py::_fuse_linear_bn_leaky_relu:0, line 85 <- wrt source file 2025-12-04T10:03:49.9537094Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/backend_config/onednn.py::_fuse_linear_bn_leaky_relu:0 2025-12-04T10:03:49.9538136Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fx/_model_report/model_report.py::ModelReport:0, line 85 <- wrt source file 2025-12-04T10:03:49.9539495Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fx/_model_report/model_report.py::ModelReport:0 2025-12-04T10:03:49.9540959Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fx/_model_report/model_report_visualizer.py::ModelReportVisualizer.generate_filtered_tables:0, line 341 <- wrt source file 2025-12-04T10:03:49.9542771Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fx/_model_report/model_report_visualizer.py::ModelReportVisualizer.generate_filtered_tables:0 2025-12-04T10:03:49.9544516Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fx/_model_report/model_report_visualizer.py::ModelReportVisualizer.generate_table_visualization:0, line 429 <- wrt source file 2025-12-04T10:03:49.9545926Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fx/_model_report/model_report_visualizer.py::ModelReportVisualizer.generate_table_visualization:0 2025-12-04T10:03:49.9547365Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fx/_model_report/model_report_visualizer.py::ModelReportVisualizer.generate_plot_visualization:0, line 591 <- wrt source file 2025-12-04T10:03:49.9548741Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fx/_model_report/model_report_visualizer.py::ModelReportVisualizer.generate_plot_visualization:0 2025-12-04T10:03:49.9550254Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fx/_model_report/model_report_visualizer.py::ModelReportVisualizer.generate_histogram_visualization:0, line 664 <- wrt source file 2025-12-04T10:03:49.9552138Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/fx/_model_report/model_report_visualizer.py::ModelReportVisualizer.generate_histogram_visualization:0 2025-12-04T10:03:49.9553376Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/pt2e/_affine_quantization.py::_get_reduction_params:0, line 104 <- wrt source file 2025-12-04T10:03:49.9554541Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/pt2e/_affine_quantization.py::_get_reduction_params:0 2025-12-04T10:03:49.9555872Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/pt2e/_affine_quantization.py::_register_custom_op:0, line 155 <- wrt source file 2025-12-04T10:03:49.9556969Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/pt2e/_affine_quantization.py::_register_custom_op:0 2025-12-04T10:03:49.9557990Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/pt2e/prepare.py::_get_edge_or_node_to_group_id:0, line 189 <- wrt source file 2025-12-04T10:03:49.9559075Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/pt2e/prepare.py::_get_edge_or_node_to_group_id:0 2025-12-04T10:03:49.9560297Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/pt2e/utils.py::_replace_literals_with_new_placeholders:0, line 442 <- wrt source file 2025-12-04T10:03:49.9561473Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/quantization/pt2e/utils.py::_replace_literals_with_new_placeholders:0 2025-12-04T10:03:49.9562445Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/anomaly_mode.py::detect_anomaly:0, line 28 <- wrt source file 2025-12-04T10:03:49.9563317Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/anomaly_mode.py::detect_anomaly:0 2025-12-04T10:03:49.9564277Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/forward_ad.py::make_dual:0, line 82 <- wrt source file 2025-12-04T10:03:49.9565098Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/forward_ad.py::make_dual:0 2025-12-04T10:03:49.9565927Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/forward_ad.py::unpack_dual:0, line 151 <- wrt source file 2025-12-04T10:03:49.9566879Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/forward_ad.py::unpack_dual:0 2025-12-04T10:03:49.9567725Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/forward_ad.py::dual_level:0, line 187 <- wrt source file 2025-12-04T10:03:49.9568553Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/forward_ad.py::dual_level:0 2025-12-04T10:03:49.9569426Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::FunctionCtx.save_for_backward:0, line 72 <- wrt source file 2025-12-04T10:03:49.9570373Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::FunctionCtx.save_for_backward:0 2025-12-04T10:03:49.9571430Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::FunctionCtx.save_for_forward:0, line 116 <- wrt source file 2025-12-04T10:03:49.9572395Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::FunctionCtx.save_for_forward:0 2025-12-04T10:03:49.9573421Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::FunctionCtx.mark_dirty:0, line 169 <- wrt source file 2025-12-04T10:03:49.9574354Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::FunctionCtx.mark_dirty:0 2025-12-04T10:03:49.9575439Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::FunctionCtx.mark_non_differentiable:0, line 216 <- wrt source file 2025-12-04T10:03:49.9576457Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::FunctionCtx.mark_non_differentiable:0 2025-12-04T10:03:49.9577553Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::FunctionCtx.set_materialize_grads:0, line 245 <- wrt source file 2025-12-04T10:03:49.9578690Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::FunctionCtx.set_materialize_grads:0 2025-12-04T10:03:49.9579566Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::Function:0, line 487 <- wrt source file 2025-12-04T10:03:49.9580373Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/function.py::Function:0 2025-12-04T10:03:49.9581271Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::vjp:0, line 300 <- wrt source file 2025-12-04T10:03:49.9582169Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::vjp:0 2025-12-04T10:03:49.9583103Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::jvp:0, line 402 <- wrt source file 2025-12-04T10:03:49.9583920Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::jvp:0 2025-12-04T10:03:49.9584720Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::jacobian:0, line 642 <- wrt source file 2025-12-04T10:03:49.9585546Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::jacobian:0 2025-12-04T10:03:49.9586462Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::hessian:0, line 907 <- wrt source file 2025-12-04T10:03:49.9587385Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::hessian:0 2025-12-04T10:03:49.9588204Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::vhp:0, line 1026 <- wrt source file 2025-12-04T10:03:49.9588998Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::vhp:0 2025-12-04T10:03:49.9589765Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::hvp:0, line 1125 <- wrt source file 2025-12-04T10:03:49.9590554Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/functional.py::hvp:0 2025-12-04T10:03:49.9591329Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/grad_mode.py::no_grad:0, line 50 <- wrt source file 2025-12-04T10:03:49.9592261Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/grad_mode.py::no_grad:0 2025-12-04T10:03:49.9593078Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/grad_mode.py::enable_grad:0, line 108 <- wrt source file 2025-12-04T10:03:49.9593908Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/grad_mode.py::enable_grad:0 2025-12-04T10:03:49.9594808Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/grad_mode.py::set_grad_enabled:0, line 166 <- wrt source file 2025-12-04T10:03:49.9595699Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/grad_mode.py::set_grad_enabled:0 2025-12-04T10:03:49.9596551Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/grad_mode.py::inference_mode:0, line 252 <- wrt source file 2025-12-04T10:03:49.9597389Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/grad_mode.py::inference_mode:0 2025-12-04T10:03:49.9598240Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::Node.name:0, line 60 <- wrt source file 2025-12-04T10:03:49.9599026Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::Node.name:0 2025-12-04T10:03:49.9599825Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::Node.register_hook:0, line 117 <- wrt source file 2025-12-04T10:03:49.9609292Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::Node.register_hook:0 2025-12-04T10:03:49.9610166Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::Node.register_prehook:0, line 154 <- wrt source file 2025-12-04T10:03:49.9624875Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::Node.register_prehook:0 2025-12-04T10:03:49.9626565Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::saved_tensors_hooks:0, line 292 <- wrt source file 2025-12-04T10:03:49.9628017Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::saved_tensors_hooks:0 2025-12-04T10:03:49.9629059Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::save_on_cpu:0, line 362 <- wrt source file 2025-12-04T10:03:49.9630056Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::save_on_cpu:0 2025-12-04T10:03:49.9631111Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::disable_saved_tensors_hooks:0, line 419 <- wrt source file 2025-12-04T10:03:49.9632191Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::disable_saved_tensors_hooks:0 2025-12-04T10:03:49.9633077Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::register_multi_grad_hook:0, line 503 <- wrt source file 2025-12-04T10:03:49.9640859Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::register_multi_grad_hook:0 2025-12-04T10:03:49.9641848Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::allow_mutation_on_saved_tensors:0, line 777 <- wrt source file 2025-12-04T10:03:49.9658499Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/graph.py::allow_mutation_on_saved_tensors:0 2025-12-04T10:03:49.9659774Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/profiler.py::profile:0, line 182 <- wrt source file 2025-12-04T10:03:49.9661046Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/profiler.py::profile:0 2025-12-04T10:03:49.9662188Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/profiler.py::record_function:0, line 760 <- wrt source file 2025-12-04T10:03:49.9663185Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/profiler.py::record_function:0 2025-12-04T10:03:49.9664374Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/profiler.py::emit_itt:0, line 899 <- wrt source file 2025-12-04T10:03:49.9665375Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/profiler.py::emit_itt:0 2025-12-04T10:03:49.9666279Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/profiler.py::emit_nvtx:0, line 972 <- wrt source file 2025-12-04T10:03:49.9667398Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/profiler.py::emit_nvtx:0 2025-12-04T10:03:49.9668595Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/profiler_util.py::EventList:0, line 60 <- wrt source file 2025-12-04T10:03:49.9669679Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/autograd/profiler_util.py::EventList:0 2025-12-04T10:03:49.9670681Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/gds.py::gds_register_buffer:0, line 43 <- wrt source file 2025-12-04T10:03:49.9671607Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/gds.py::gds_register_buffer:0 2025-12-04T10:03:49.9672591Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/gds.py::gds_deregister_buffer:0, line 59 <- wrt source file 2025-12-04T10:03:49.9673629Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/gds.py::gds_deregister_buffer:0 2025-12-04T10:03:49.9674654Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/gds.py::GdsFile:0, line 86 <- wrt source file 2025-12-04T10:03:49.9675704Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/gds.py::GdsFile:0 2025-12-04T10:03:49.9676582Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/jiterator.py::_create_jit_fn:0, line 114 <- wrt source file 2025-12-04T10:03:49.9677643Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/jiterator.py::_create_jit_fn:0 2025-12-04T10:03:49.9678591Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/jiterator.py::_create_jit_fn:1, line 125 <- wrt source file 2025-12-04T10:03:49.9679567Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/jiterator.py::_create_jit_fn:1 2025-12-04T10:03:49.9680585Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/jiterator.py::_create_jit_fn:2, line 140 <- wrt source file 2025-12-04T10:03:49.9681454Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/jiterator.py::_create_jit_fn:2 2025-12-04T10:03:49.9682368Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/jiterator.py::_create_multi_output_jit_fn:0, line 173 <- wrt source file 2025-12-04T10:03:49.9683552Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/jiterator.py::_create_multi_output_jit_fn:0 2025-12-04T10:03:49.9684648Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/profiler.py::profile:0, line 75 <- wrt source file 2025-12-04T10:03:49.9685686Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/profiler.py::profile:0 2025-12-04T10:03:49.9686806Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_mesh_layout.py::_MeshLayout.composition:0, line 125 <- wrt source file 2025-12-04T10:03:49.9687920Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_mesh_layout.py::_MeshLayout.composition:0 2025-12-04T10:03:49.9689000Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_mesh_layout.py::_MeshLayout.complement:0, line 142 <- wrt source file 2025-12-04T10:03:49.9690029Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_mesh_layout.py::_MeshLayout.complement:0 2025-12-04T10:03:49.9691004Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_mesh_layout.py::_MeshLayout.remap_to_tensor:0, line 281 <- wrt source file 2025-12-04T10:03:49.9692064Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_mesh_layout.py::_MeshLayout.remap_to_tensor:0 2025-12-04T10:03:49.9693114Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/device_mesh.py::DeviceMesh:0, line 167 <- wrt source file 2025-12-04T10:03:49.9693998Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/device_mesh.py::DeviceMesh:0 2025-12-04T10:03:49.9695135Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/device_mesh.py::DeviceMesh.get_local_rank:0, line 1027 <- wrt source file 2025-12-04T10:03:49.9696254Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/device_mesh.py::DeviceMesh.get_local_rank:0 2025-12-04T10:03:49.9697301Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/device_mesh.py::init_device_mesh:0, line 1317 <- wrt source file 2025-12-04T10:03:49.9698231Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/device_mesh.py::init_device_mesh:0 2025-12-04T10:03:49.9699456Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::_coalescing_manager:0, line 2652 <- wrt source file 2025-12-04T10:03:49.9700454Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::_coalescing_manager:0 2025-12-04T10:03:49.9701381Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::_time_estimator:0, line 2754 <- wrt source file 2025-12-04T10:03:49.9702328Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::_time_estimator:0 2025-12-04T10:03:49.9703254Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::batch_isend_irecv:0, line 2801 <- wrt source file 2025-12-04T10:03:49.9704215Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::batch_isend_irecv:0 2025-12-04T10:03:49.9705124Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_reduce:0, line 2938 <- wrt source file 2025-12-04T10:03:49.9706028Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_reduce:0 2025-12-04T10:03:49.9706925Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_gather_object:0, line 3221 <- wrt source file 2025-12-04T10:03:49.9707979Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_gather_object:0 2025-12-04T10:03:49.9708911Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::gather_object:0, line 3325 <- wrt source file 2025-12-04T10:03:49.9709834Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::gather_object:0 2025-12-04T10:03:49.9710736Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::send_object_list:0, line 3457 <- wrt source file 2025-12-04T10:03:49.9711720Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::send_object_list:0 2025-12-04T10:03:49.9712804Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::recv_object_list:0, line 3574 <- wrt source file 2025-12-04T10:03:49.9713768Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::recv_object_list:0 2025-12-04T10:03:49.9714705Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::broadcast_object_list:0, line 3719 <- wrt source file 2025-12-04T10:03:49.9715744Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::broadcast_object_list:0 2025-12-04T10:03:49.9716691Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::scatter_object_list:0, line 3844 <- wrt source file 2025-12-04T10:03:49.9717665Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::scatter_object_list:0 2025-12-04T10:03:49.9718841Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_gather:0, line 3947 <- wrt source file 2025-12-04T10:03:49.9719769Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_gather:0 2025-12-04T10:03:49.9720879Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_gather_into_tensor:0, line 4054 <- wrt source file 2025-12-04T10:03:49.9722077Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_gather_into_tensor:0 2025-12-04T10:03:49.9723042Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_gather_coalesced:0, line 4192 <- wrt source file 2025-12-04T10:03:49.9724128Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_gather_coalesced:0 2025-12-04T10:03:49.9725177Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::gather:0, line 4298 <- wrt source file 2025-12-04T10:03:49.9726269Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::gather:0 2025-12-04T10:03:49.9727274Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::scatter:0, line 4383 <- wrt source file 2025-12-04T10:03:49.9728165Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::scatter:0 2025-12-04T10:03:49.9729077Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::reduce_scatter_tensor:0, line 4521 <- wrt source file 2025-12-04T10:03:49.9730056Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::reduce_scatter_tensor:0 2025-12-04T10:03:49.9730995Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_to_all_single:0, line 4663 <- wrt source file 2025-12-04T10:03:49.9731953Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_to_all_single:0 2025-12-04T10:03:49.9732860Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_to_all:0, line 4797 <- wrt source file 2025-12-04T10:03:49.9733746Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::all_to_all:0 2025-12-04T10:03:49.9734720Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::monitored_barrier:0, line 5009 <- wrt source file 2025-12-04T10:03:49.9735689Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::monitored_barrier:0 2025-12-04T10:03:49.9736601Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::new_subgroups:0, line 5562 <- wrt source file 2025-12-04T10:03:49.9737520Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::new_subgroups:0 2025-12-04T10:03:49.9738538Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::new_subgroups_by_enumeration:0, line 5656 <- wrt source file 2025-12-04T10:03:49.9739572Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py::new_subgroups_by_enumeration:0 2025-12-04T10:03:49.9740460Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/launch.py::__doc__:0, line 84 <- wrt source file 2025-12-04T10:03:49.9741253Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/launch.py::__doc__:0 2025-12-04T10:03:49.9742020Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/run.py::__doc__:0, line 57 <- wrt source file 2025-12-04T10:03:49.9742888Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/run.py::__doc__:0 2025-12-04T10:03:49.9743706Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/autograd/__init__.py::context:0, line 47 <- wrt source file 2025-12-04T10:03:49.9744585Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/autograd/__init__.py::context:0 2025-12-04T10:03:49.9745510Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_composable/checkpoint_activation.py::checkpoint:0, line 53 <- wrt source file 2025-12-04T10:03:49.9746529Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_composable/checkpoint_activation.py::checkpoint:0 2025-12-04T10:03:49.9747587Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_composable/contract.py::contract:0, line 67 <- wrt source file 2025-12-04T10:03:49.9748511Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_composable/contract.py::contract:0 2025-12-04T10:03:49.9749403Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_composable/replicate.py::replicate:0, line 190 <- wrt source file 2025-12-04T10:03:49.9754127Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_composable/replicate.py::replicate:0 2025-12-04T10:03:49.9755502Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_composable/replicate_with_fsdp.py::replicate:0, line 265 <- wrt source file 2025-12-04T10:03:49.9756692Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_composable/replicate_with_fsdp.py::replicate:0 2025-12-04T10:03:49.9757750Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_optim/__init__.py::named_params_with_sharded_tensor:0, line 31 <- wrt source file 2025-12-04T10:03:49.9758997Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_optim/__init__.py::named_params_with_sharded_tensor:0 2025-12-04T10:03:49.9760170Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/__init__.py::init_from_local_shards:0, line 384 <- wrt source file 2025-12-04T10:03:49.9761267Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/__init__.py::init_from_local_shards:0 2025-12-04T10:03:49.9762301Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/__init__.py::custom_sharded_op_impl:0, line 457 <- wrt source file 2025-12-04T10:03:49.9763647Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/__init__.py::custom_sharded_op_impl:0 2025-12-04T10:03:49.9765110Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/api.py::ShardedTensor._init_from_local_tensor:0, line 860 <- wrt source file 2025-12-04T10:03:49.9766485Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/api.py::ShardedTensor._init_from_local_tensor:0 2025-12-04T10:03:49.9767850Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/api.py::ShardedTensor.reshard:0, line 1098 <- wrt source file 2025-12-04T10:03:49.9769138Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/api.py::ShardedTensor.reshard:0 2025-12-04T10:03:49.9770544Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/_ops/_common.py::_sharded_op_common:0, line 18 <- wrt source file 2025-12-04T10:03:49.9771939Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharded_tensor/_ops/_common.py::_sharded_op_common:0 2025-12-04T10:03:49.9773170Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharding_plan/api.py::ShardingPlan:0, line 36 <- wrt source file 2025-12-04T10:03:49.9774352Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_shard/sharding_plan/api.py::ShardingPlan:0 2025-12-04T10:03:49.9775517Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::put:0, line 275 <- wrt source file 2025-12-04T10:03:49.9776813Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::put:0 2025-12-04T10:03:49.9778191Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::get:0, line 328 <- wrt source file 2025-12-04T10:03:49.9779454Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::get:0 2025-12-04T10:03:49.9780632Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::get_nbi:0, line 378 <- wrt source file 2025-12-04T10:03:49.9781937Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::get_nbi:0 2025-12-04T10:03:49.9783179Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::putmem_signal_block:0, line 453 <- wrt source file 2025-12-04T10:03:49.9784268Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::putmem_signal_block:0 2025-12-04T10:03:49.9785501Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::wait_until:0, line 531 <- wrt source file 2025-12-04T10:03:49.9786618Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::wait_until:0 2025-12-04T10:03:49.9787754Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::signal_wait_until:0, line 593 <- wrt source file 2025-12-04T10:03:49.9789040Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::signal_wait_until:0 2025-12-04T10:03:49.9790462Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::signal_op:0, line 651 <- wrt source file 2025-12-04T10:03:49.9792007Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::signal_op:0 2025-12-04T10:03:49.9793407Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::fence:0, line 704 <- wrt source file 2025-12-04T10:03:49.9794887Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::fence:0 2025-12-04T10:03:49.9796261Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::quiet:0, line 750 <- wrt source file 2025-12-04T10:03:49.9797676Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::quiet:0 2025-12-04T10:03:49.9799278Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::my_pe:0, line 794 <- wrt source file 2025-12-04T10:03:49.9800694Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::my_pe:0 2025-12-04T10:03:49.9801703Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::n_pes:0, line 837 <- wrt source file 2025-12-04T10:03:49.9802685Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::n_pes:0 2025-12-04T10:03:49.9803947Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::barrier_all:0, line 888 <- wrt source file 2025-12-04T10:03:49.9805281Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::barrier_all:0 2025-12-04T10:03:49.9806489Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::sync_all:0, line 934 <- wrt source file 2025-12-04T10:03:49.9807692Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::sync_all:0 2025-12-04T10:03:49.9808704Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::alltoall:0, line 973 <- wrt source file 2025-12-04T10:03:49.9809927Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::alltoall:0 2025-12-04T10:03:49.9810931Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::broadcast:0, line 1028 <- wrt source file 2025-12-04T10:03:49.9812514Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::broadcast:0 2025-12-04T10:03:49.9813961Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::reduce:0, line 1089 <- wrt source file 2025-12-04T10:03:49.9815531Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::reduce:0 2025-12-04T10:03:49.9817093Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::reduce_extern_wrapper:0, line 1135 <- wrt source file 2025-12-04T10:03:49.9818657Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_symmetric_memory/_nvshmem_triton.py::reduce_extern_wrapper:0 2025-12-04T10:03:49.9820068Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_tools/memory_tracker.py::MemoryTracker:0, line 55 <- wrt source file 2025-12-04T10:03:49.9821051Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/_tools/memory_tracker.py::MemoryTracker:0 2025-12-04T10:03:49.9821948Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/join.py::Join:0, line 141 <- wrt source file 2025-12-04T10:03:49.9822834Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/join.py::Join:0 2025-12-04T10:03:49.9823807Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/__init__.py::register_ddp_comm_hook:0, line 137 <- wrt source file 2025-12-04T10:03:49.9824968Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/__init__.py::register_ddp_comm_hook:0 2025-12-04T10:03:49.9826070Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/debugging_hooks.py::noop_hook:0, line 23 <- wrt source file 2025-12-04T10:03:49.9827147Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/debugging_hooks.py::noop_hook:0 2025-12-04T10:03:49.9828337Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py::allreduce_hook:0, line 51 <- wrt source file 2025-12-04T10:03:49.9829432Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py::allreduce_hook:0 2025-12-04T10:03:49.9830512Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py::fp16_compress_hook:0, line 110 <- wrt source file 2025-12-04T10:03:49.9831634Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py::fp16_compress_hook:0 2025-12-04T10:03:49.9832725Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py::bf16_compress_hook:0, line 131 <- wrt source file 2025-12-04T10:03:49.9833827Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py::bf16_compress_hook:0 2025-12-04T10:03:49.9834926Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py::fp16_compress_wrapper:0, line 149 <- wrt source file 2025-12-04T10:03:49.9836056Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py::fp16_compress_wrapper:0 2025-12-04T10:03:49.9837174Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py::bf16_compress_wrapper:0, line 188 <- wrt source file 2025-12-04T10:03:49.9838346Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/default_hooks.py::bf16_compress_wrapper:0 2025-12-04T10:03:49.9839464Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/post_localSGD_hook.py::post_localSGD_hook:0, line 91 <- wrt source file 2025-12-04T10:03:49.9840607Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/post_localSGD_hook.py::post_localSGD_hook:0 2025-12-04T10:03:49.9841701Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py::powerSGD_hook:0, line 395 <- wrt source file 2025-12-04T10:03:49.9842857Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py::powerSGD_hook:0 2025-12-04T10:03:49.9844236Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py::batched_powerSGD_hook:0, line 708 <- wrt source file 2025-12-04T10:03:49.9845531Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/powerSGD_hook.py::batched_powerSGD_hook:0 2025-12-04T10:03:49.9846929Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/quantization_hooks.py::quantization_pertensor_hook:0, line 64 <- wrt source file 2025-12-04T10:03:49.9848585Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/quantization_hooks.py::quantization_pertensor_hook:0 2025-12-04T10:03:49.9850090Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/quantization_hooks.py::quantization_perchannel_hook:0, line 146 <- wrt source file 2025-12-04T10:03:49.9851505Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/ddp_comm_hooks/quantization_hooks.py::quantization_perchannel_hook:0 2025-12-04T10:03:49.9852960Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/model_averaging/averagers.py::PeriodicModelAverager:0, line 56 <- wrt source file 2025-12-04T10:03:49.9854347Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/model_averaging/averagers.py::PeriodicModelAverager:0 2025-12-04T10:03:49.9855835Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py::HierarchicalModelAverager:0, line 53 <- wrt source file 2025-12-04T10:03:49.9857292Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/algorithms/model_averaging/hierarchical_model_averager.py::HierarchicalModelAverager:0 2025-12-04T10:03:49.9858485Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/format_utils.py::BroadcastingTorchSaveReader:0, line 49 <- wrt source file 2025-12-04T10:03:49.9859718Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/format_utils.py::BroadcastingTorchSaveReader:0 2025-12-04T10:03:49.9860903Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/format_utils.py::DynamicMetaLoadPlanner:0, line 173 <- wrt source file 2025-12-04T10:03:49.9862079Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/format_utils.py::DynamicMetaLoadPlanner:0 2025-12-04T10:03:49.9863367Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/optimizer.py::load_sharded_optimizer_state_dict:0, line 228 <- wrt source file 2025-12-04T10:03:49.9864983Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/optimizer.py::load_sharded_optimizer_state_dict:0 2025-12-04T10:03:49.9866253Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py::get_state_dict:0, line 1276 <- wrt source file 2025-12-04T10:03:49.9867482Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py::get_state_dict:0 2025-12-04T10:03:49.9868820Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py::_patch_model_state_dict:0, line 1531 <- wrt source file 2025-12-04T10:03:49.9870081Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py::_patch_model_state_dict:0 2025-12-04T10:03:49.9871270Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py::_patch_optimizer_state_dict:0, line 1590 <- wrt source file 2025-12-04T10:03:49.9872454Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict.py::_patch_optimizer_state_dict:0 2025-12-04T10:03:49.9873552Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict_loader.py::load:0, line 131 <- wrt source file 2025-12-04T10:03:49.9874601Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict_loader.py::load:0 2025-12-04T10:03:49.9875597Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict_saver.py::save:0, line 160 <- wrt source file 2025-12-04T10:03:49.9876554Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict_saver.py::save:0 2025-12-04T10:03:49.9877619Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict_saver.py::async_save:0, line 275 <- wrt source file 2025-12-04T10:03:49.9878602Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict_saver.py::async_save:0 2025-12-04T10:03:49.9879613Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/barriers.py::BarrierConfig:0, line 50 <- wrt source file 2025-12-04T10:03:49.9880682Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/barriers.py::BarrierConfig:0 2025-12-04T10:03:49.9881878Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/builder.py::make_sync_checkpointer:0, line 78 <- wrt source file 2025-12-04T10:03:49.9882987Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/builder.py::make_sync_checkpointer:0 2025-12-04T10:03:49.9884071Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/builder.py::make_async_checkpointer:0, line 139 <- wrt source file 2025-12-04T10:03:49.9885290Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/builder.py::make_async_checkpointer:0 2025-12-04T10:03:49.9886425Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/checkpointer.py::SyncCheckpointer:0, line 104 <- wrt source file 2025-12-04T10:03:49.9887550Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/checkpointer.py::SyncCheckpointer:0 2025-12-04T10:03:49.9888705Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/checkpointer.py::SyncCheckpointer.save:0, line 142 <- wrt source file 2025-12-04T10:03:49.9889860Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/checkpointer.py::SyncCheckpointer.save:0 2025-12-04T10:03:49.9890978Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/checkpointer.py::AsyncCheckpointer:0, line 213 <- wrt source file 2025-12-04T10:03:49.9892166Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/checkpointer.py::AsyncCheckpointer:0 2025-12-04T10:03:49.9893286Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/checkpointer.py::AsyncCheckpointer.save:0, line 260 <- wrt source file 2025-12-04T10:03:49.9894429Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/checkpointer.py::AsyncCheckpointer.save:0 2025-12-04T10:03:49.9895522Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/staging.py::DefaultStager.close:0, line 211 <- wrt source file 2025-12-04T10:03:49.9896608Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/_experimental/staging.py::DefaultStager.close:0 2025-12-04T10:03:49.9897748Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/events/__init__.py::construct_and_record_rdzv_event:0, line 110 <- wrt source file 2025-12-04T10:03:49.9898855Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/events/__init__.py::construct_and_record_rdzv_event:0 2025-12-04T10:03:49.9899928Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/api.py::RendezvousHandler.shutdown:0, line 232 <- wrt source file 2025-12-04T10:03:49.9901185Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/api.py::RendezvousHandler.shutdown:0 2025-12-04T10:03:49.9902330Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/utils/distributed.py::get_free_port:0, line 140 <- wrt source file 2025-12-04T10:03:49.9903345Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/elastic/utils/distributed.py::get_free_port:0 2025-12-04T10:03:49.9904266Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/api.py::MixedPrecision:0, line 202 <- wrt source file 2025-12-04T10:03:49.9905300Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/api.py::MixedPrecision:0 2025-12-04T10:03:49.9906298Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/api.py::StateDictType:0, line 262 <- wrt source file 2025-12-04T10:03:49.9907256Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/api.py::StateDictType:0 2025-12-04T10:03:49.9908267Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel:0, line 125 <- wrt source file 2025-12-04T10:03:49.9909412Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel:0 2025-12-04T10:03:49.9910893Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.set_state_dict_type:0, line 651 <- wrt source file 2025-12-04T10:03:49.9912436Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.set_state_dict_type:0 2025-12-04T10:03:49.9913754Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.state_dict_type:0, line 805 <- wrt source file 2025-12-04T10:03:49.9915284Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.state_dict_type:0 2025-12-04T10:03:49.9916850Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.shard_full_optim_state_dict:0, line 1513 <- wrt source file 2025-12-04T10:03:49.9918336Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.shard_full_optim_state_dict:0 2025-12-04T10:03:49.9919950Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.scatter_full_optim_state_dict:0, line 1633 <- wrt source file 2025-12-04T10:03:49.9921526Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.scatter_full_optim_state_dict:0 2025-12-04T10:03:49.9923388Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.rekey_optim_state_dict:0, line 1718 <- wrt source file 2025-12-04T10:03:49.9924942Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.rekey_optim_state_dict:0 2025-12-04T10:03:49.9926531Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.optim_state_dict:0, line 1850 <- wrt source file 2025-12-04T10:03:49.9928186Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.optim_state_dict:0 2025-12-04T10:03:49.9929742Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.optim_state_dict_to_load:0, line 1937 <- wrt source file 2025-12-04T10:03:49.9931400Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py::FullyShardedDataParallel.optim_state_dict_to_load:0 2025-12-04T10:03:49.9932550Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/sharded_grad_scaler.py::ShardedGradScaler:0, line 57 <- wrt source file 2025-12-04T10:03:49.9933573Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/sharded_grad_scaler.py::ShardedGradScaler:0 2025-12-04T10:03:49.9934487Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py::CustomPolicy:0, line 227 <- wrt source file 2025-12-04T10:03:49.9935358Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/wrap.py::CustomPolicy:0 2025-12-04T10:03:49.9936444Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/nn/functional.py::_all_gather_base:0, line 134 <- wrt source file 2025-12-04T10:03:49.9937368Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/nn/functional.py::_all_gather_base:0 2025-12-04T10:03:49.9938379Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/nn/api/remote_module.py::_RemoteModule.__init__:0, line 196 <- wrt source file 2025-12-04T10:03:49.9939414Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/nn/api/remote_module.py::_RemoteModule.__init__:0 2025-12-04T10:03:49.9940447Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/nn/api/remote_module.py::_RemoteModule.init_from_module_rref:0, line 527 <- wrt source file 2025-12-04T10:03:49.9941843Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/nn/api/remote_module.py::_RemoteModule.init_from_module_rref:0 2025-12-04T10:03:49.9942853Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/nn/api/remote_module.py::RemoteModule:0, line 658 <- wrt source file 2025-12-04T10:03:49.9943805Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/nn/api/remote_module.py::RemoteModule:0 2025-12-04T10:03:49.9944826Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/apply_optimizer_in_backward.py::_apply_optimizer_in_backward:0, line 43 <- wrt source file 2025-12-04T10:03:49.9945959Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/apply_optimizer_in_backward.py::_apply_optimizer_in_backward:0 2025-12-04T10:03:49.9947295Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/apply_optimizer_in_backward.py::_get_in_backward_optimizers:0, line 114 <- wrt source file 2025-12-04T10:03:49.9948582Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/apply_optimizer_in_backward.py::_get_in_backward_optimizers:0 2025-12-04T10:03:49.9949636Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/named_optimizer.py::_NamedOptimizer:0, line 43 <- wrt source file 2025-12-04T10:03:49.9950619Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/named_optimizer.py::_NamedOptimizer:0 2025-12-04T10:03:49.9951586Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/optimizer.py::DistributedOptimizer:0, line 161 <- wrt source file 2025-12-04T10:03:49.9952664Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/optimizer.py::DistributedOptimizer:0 2025-12-04T10:03:49.9953825Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/post_localSGD_optimizer.py::PostLocalSGDOptimizer:0, line 19 <- wrt source file 2025-12-04T10:03:49.9954923Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/post_localSGD_optimizer.py::PostLocalSGDOptimizer:0 2025-12-04T10:03:49.9956110Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/utils.py::register_functional_optim:0, line 37 <- wrt source file 2025-12-04T10:03:49.9957079Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/utils.py::register_functional_optim:0 2025-12-04T10:03:49.9958351Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/zero_redundancy_optimizer.py::ZeroRedundancyOptimizer:0, line 341 <- wrt source file 2025-12-04T10:03:49.9959526Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/optim/zero_redundancy_optimizer.py::ZeroRedundancyOptimizer:0 2025-12-04T10:03:49.9960503Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/pipelining/_IR.py::pipe_split:0, line 345 <- wrt source file 2025-12-04T10:03:49.9961489Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/pipelining/_IR.py::pipe_split:0 2025-12-04T10:03:49.9962444Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/pipelining/microbatch.py::_CustomReducer:0, line 36 <- wrt source file 2025-12-04T10:03:49.9963674Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/pipelining/microbatch.py::_CustomReducer:0 2025-12-04T10:03:49.9964803Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/pipelining/microbatch.py::TensorChunkSpec.from_tuple:0, line 85 <- wrt source file 2025-12-04T10:03:49.9965896Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/pipelining/microbatch.py::TensorChunkSpec.from_tuple:0 2025-12-04T10:03:49.9966948Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/pipelining/microbatch.py::TensorChunkSpec.from_dict:0, line 104 <- wrt source file 2025-12-04T10:03:49.9968015Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/pipelining/microbatch.py::TensorChunkSpec.from_dict:0 2025-12-04T10:03:49.9969034Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/api.py::_wait_all:0, line 174 <- wrt source file 2025-12-04T10:03:49.9970069Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/api.py::_wait_all:0 2025-12-04T10:03:49.9971270Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/api.py::shutdown:0, line 343 <- wrt source file 2025-12-04T10:03:49.9972142Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/api.py::shutdown:0 2025-12-04T10:03:49.9972936Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/api.py::remote:0, line 605 <- wrt source file 2025-12-04T10:03:49.9973747Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/api.py::remote:0 2025-12-04T10:03:49.9974547Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/api.py::rpc_sync:0, line 786 <- wrt source file 2025-12-04T10:03:49.9975359Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/api.py::rpc_sync:0 2025-12-04T10:03:49.9976170Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/api.py::rpc_async:0, line 878 <- wrt source file 2025-12-04T10:03:49.9977201Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/api.py::rpc_async:0 2025-12-04T10:03:49.9978097Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/functions.py::async_execution:0, line 34 <- wrt source file 2025-12-04T10:03:49.9979043Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/functions.py::async_execution:0 2025-12-04T10:03:49.9980062Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/options.py::TensorPipeRpcBackendOptions.set_device_map:0, line 126 <- wrt source file 2025-12-04T10:03:49.9981281Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/options.py::TensorPipeRpcBackendOptions.set_device_map:0 2025-12-04T10:03:49.9982517Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/server_process_global_profiler.py::_server_process_global_profile:0, line 62 <- wrt source file 2025-12-04T10:03:49.9983764Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/rpc/server_process_global_profiler.py::_server_process_global_profile:0 2025-12-04T10:03:49.9984779Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_api.py::_shard_tensor:0, line 887 <- wrt source file 2025-12-04T10:03:49.9985870Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_api.py::_shard_tensor:0 2025-12-04T10:03:49.9986806Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_dtensor_spec.py::ShardOrderEntry:0, line 32 <- wrt source file 2025-12-04T10:03:49.9987982Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_dtensor_spec.py::ShardOrderEntry:0 2025-12-04T10:03:49.9989052Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_dtensor_spec.py::DTensorSpec._convert_shard_order_to_StridedShard:0, line 165 <- wrt source file 2025-12-04T10:03:49.9990227Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_dtensor_spec.py::DTensorSpec._convert_shard_order_to_StridedShard:0 2025-12-04T10:03:49.9991627Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_dtensor_spec.py::DTensorSpec._maybe_convert_StridedShard_to_shard_order:0, line 241 <- wrt source file 2025-12-04T10:03:49.9992870Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_dtensor_spec.py::DTensorSpec._maybe_convert_StridedShard_to_shard_order:0 2025-12-04T10:03:49.9994099Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_dtensor_spec.py::DTensorSpec.format_shard_order_str:0, line 461 <- wrt source file 2025-12-04T10:03:49.9995206Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_dtensor_spec.py::DTensorSpec.format_shard_order_str:0 2025-12-04T10:03:49.9996470Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_random.py::OffsetBasedRNGTracker._set_pre_op_offset:0, line 310 <- wrt source file 2025-12-04T10:03:49.9997570Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_random.py::OffsetBasedRNGTracker._set_pre_op_offset:0 2025-12-04T10:03:49.9998575Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_ops/_common_rules.py::pointwise_rule:0, line 234 <- wrt source file 2025-12-04T10:03:49.9999576Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/_ops/_common_rules.py::pointwise_rule:0 2025-12-04T10:03:50.0000574Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_func_map.py::local_map:0, line 103 <- wrt source file 2025-12-04T10:03:50.0001784Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_func_map.py::local_map:0 2025-12-04T10:03:50.0002856Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_register_sharding.py::register_sharding:0, line 46 <- wrt source file 2025-12-04T10:03:50.0003973Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_register_sharding.py::register_sharding:0 2025-12-04T10:03:50.0005164Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_context_parallel/_load_balancer.py::_LoadBalancer._generate_indices:0, line 30 <- wrt source file 2025-12-04T10:03:50.0006657Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_context_parallel/_load_balancer.py::_LoadBalancer._generate_indices:0 2025-12-04T10:03:50.0008039Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_context_parallel/_load_balancer.py::_HeadTailLoadBalancer._generate_indices:0, line 102 <- wrt source file 2025-12-04T10:03:50.0009427Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_context_parallel/_load_balancer.py::_HeadTailLoadBalancer._generate_indices:0 2025-12-04T10:03:50.0010801Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_context_parallel/_load_balancer.py::_PerDocumentHeadTailLoadBalancer._generate_indices:0, line 213 <- wrt source file 2025-12-04T10:03:50.0012285Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_context_parallel/_load_balancer.py::_PerDocumentHeadTailLoadBalancer._generate_indices:0 2025-12-04T10:03:50.0013853Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_context_parallel/_load_balancer.py::_PTRRLoadBalancer.ptrr_scheduling:0, line 339 <- wrt source file 2025-12-04T10:03:50.0015208Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_context_parallel/_load_balancer.py::_PTRRLoadBalancer.ptrr_scheduling:0 2025-12-04T10:03:50.0016558Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_context_parallel/_load_balancer.py::_PTRRLoadBalancer._generate_indices:0, line 397 <- wrt source file 2025-12-04T10:03:50.0018087Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/experimental/_context_parallel/_load_balancer.py::_PTRRLoadBalancer._generate_indices:0 2025-12-04T10:03:50.0019244Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/api.py::parallelize_module:0, line 55 <- wrt source file 2025-12-04T10:03:50.0020252Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/api.py::parallelize_module:0 2025-12-04T10:03:50.0021243Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/ddp.py::_pre_dp_module_transform:0, line 88 <- wrt source file 2025-12-04T10:03:50.0022275Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/ddp.py::_pre_dp_module_transform:0 2025-12-04T10:03:50.0023465Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/loss.py::loss_parallel:0, line 56 <- wrt source file 2025-12-04T10:03:50.0024494Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/loss.py::loss_parallel:0 2025-12-04T10:03:50.0025462Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::ColwiseParallel:0, line 64 <- wrt source file 2025-12-04T10:03:50.0026456Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::ColwiseParallel:0 2025-12-04T10:03:50.0027513Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::RowwiseParallel:0, line 198 <- wrt source file 2025-12-04T10:03:50.0028525Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::RowwiseParallel:0 2025-12-04T10:03:50.0029498Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::SequenceParallel:0, line 350 <- wrt source file 2025-12-04T10:03:50.0030556Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::SequenceParallel:0 2025-12-04T10:03:50.0031561Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::PrepareModuleInput:0, line 452 <- wrt source file 2025-12-04T10:03:50.0032817Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::PrepareModuleInput:0 2025-12-04T10:03:50.0033819Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::PrepareModuleOutput:0, line 614 <- wrt source file 2025-12-04T10:03:50.0034927Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::PrepareModuleOutput:0 2025-12-04T10:03:50.0035959Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::PrepareModuleInputOutput:0, line 740 <- wrt source file 2025-12-04T10:03:50.0037034Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/tensor/parallel/style.py::PrepareModuleInputOutput:0 2025-12-04T10:03:50.0038165Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/bernoulli.py::Bernoulli:0, line 30 <- wrt source file 2025-12-04T10:03:50.0047507Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/bernoulli.py::Bernoulli:0 2025-12-04T10:03:50.0048783Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/beta.py::Beta:0, line 21 <- wrt source file 2025-12-04T10:03:50.0049713Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/beta.py::Beta:0 2025-12-04T10:03:50.0050537Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/binomial.py::Binomial:0, line 31 <- wrt source file 2025-12-04T10:03:50.0051388Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/binomial.py::Binomial:0 2025-12-04T10:03:50.0052254Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/categorical.py::Categorical:0, line 42 <- wrt source file 2025-12-04T10:03:50.0053156Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/categorical.py::Categorical:0 2025-12-04T10:03:50.0053990Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/cauchy.py::Cauchy:0, line 23 <- wrt source file 2025-12-04T10:03:50.0054806Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/cauchy.py::Cauchy:0 2025-12-04T10:03:50.0055780Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/chi2.py::Chi2:0, line 18 <- wrt source file 2025-12-04T10:03:50.0056585Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/chi2.py::Chi2:0 2025-12-04T10:03:50.0057627Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/constraints.py::is_dependent:0, line 167 <- wrt source file 2025-12-04T10:03:50.0058557Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/constraints.py::is_dependent:0 2025-12-04T10:03:50.0059498Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/constraints.py::_DependentProperty:0, line 188 <- wrt source file 2025-12-04T10:03:50.0060473Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/constraints.py::_DependentProperty:0 2025-12-04T10:03:50.0061563Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/continuous_bernoulli.py::ContinuousBernoulli:0, line 35 <- wrt source file 2025-12-04T10:03:50.0062864Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/continuous_bernoulli.py::ContinuousBernoulli:0 2025-12-04T10:03:50.0063830Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/dirichlet.py::Dirichlet:0, line 44 <- wrt source file 2025-12-04T10:03:50.0064738Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/dirichlet.py::Dirichlet:0 2025-12-04T10:03:50.0065743Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/exponential.py::Exponential:0, line 20 <- wrt source file 2025-12-04T10:03:50.0066648Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/exponential.py::Exponential:0 2025-12-04T10:03:50.0067689Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/fishersnedecor.py::FisherSnedecor:0, line 21 <- wrt source file 2025-12-04T10:03:50.0068655Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/fishersnedecor.py::FisherSnedecor:0 2025-12-04T10:03:50.0070021Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/gamma.py::Gamma:0, line 24 <- wrt source file 2025-12-04T10:03:50.0071290Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/gamma.py::Gamma:0 2025-12-04T10:03:50.0072437Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/generalized_pareto.py::GeneralizedPareto:0, line 26 <- wrt source file 2025-12-04T10:03:50.0073475Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/generalized_pareto.py::GeneralizedPareto:0 2025-12-04T10:03:50.0074803Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/geometric.py::Geometric:0, line 36 <- wrt source file 2025-12-04T10:03:50.0076071Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/geometric.py::Geometric:0 2025-12-04T10:03:50.0077475Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/gumbel.py::Gumbel:0, line 23 <- wrt source file 2025-12-04T10:03:50.0078366Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/gumbel.py::Gumbel:0 2025-12-04T10:03:50.0079220Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/half_cauchy.py::HalfCauchy:0, line 24 <- wrt source file 2025-12-04T10:03:50.0080131Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/half_cauchy.py::HalfCauchy:0 2025-12-04T10:03:50.0081014Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/half_normal.py::HalfNormal:0, line 24 <- wrt source file 2025-12-04T10:03:50.0082275Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/half_normal.py::HalfNormal:0 2025-12-04T10:03:50.0083180Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/independent.py::Independent:0, line 27 <- wrt source file 2025-12-04T10:03:50.0084104Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/independent.py::Independent:0 2025-12-04T10:03:50.0084999Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/inverse_gamma.py::InverseGamma:0, line 24 <- wrt source file 2025-12-04T10:03:50.0085914Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/inverse_gamma.py::InverseGamma:0 2025-12-04T10:03:50.0086916Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/kumaraswamy.py::Kumaraswamy:0, line 30 <- wrt source file 2025-12-04T10:03:50.0088201Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/kumaraswamy.py::Kumaraswamy:0 2025-12-04T10:03:50.0089064Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/laplace.py::Laplace:0, line 20 <- wrt source file 2025-12-04T10:03:50.0089901Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/laplace.py::Laplace:0 2025-12-04T10:03:50.0090845Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/lkj_cholesky.py::LKJCholesky:0, line 43 <- wrt source file 2025-12-04T10:03:50.0091746Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/lkj_cholesky.py::LKJCholesky:0 2025-12-04T10:03:50.0092768Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/log_normal.py::LogNormal:0, line 23 <- wrt source file 2025-12-04T10:03:50.0094020Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/log_normal.py::LogNormal:0 2025-12-04T10:03:50.0095064Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/logistic_normal.py::LogisticNormal:0, line 28 <- wrt source file 2025-12-04T10:03:50.0096055Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/logistic_normal.py::LogisticNormal:0 2025-12-04T10:03:50.0097189Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/lowrank_multivariate_normal.py::LowRankMultivariateNormal:0, line 63 <- wrt source file 2025-12-04T10:03:50.0098311Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/lowrank_multivariate_normal.py::LowRankMultivariateNormal:0 2025-12-04T10:03:50.0099645Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/mixture_same_family.py::MixtureSameFamily:0, line 24 <- wrt source file 2025-12-04T10:03:50.0100825Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/mixture_same_family.py::MixtureSameFamily:0 2025-12-04T10:03:50.0101774Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/multinomial.py::Multinomial:0, line 38 <- wrt source file 2025-12-04T10:03:50.0102702Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/multinomial.py::Multinomial:0 2025-12-04T10:03:50.0103647Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/multivariate_normal.py::MultivariateNormal:0, line 103 <- wrt source file 2025-12-04T10:03:50.0104658Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/multivariate_normal.py::MultivariateNormal:0 2025-12-04T10:03:50.0105840Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/normal.py::Normal:0, line 22 <- wrt source file 2025-12-04T10:03:50.0106887Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/normal.py::Normal:0 2025-12-04T10:03:50.0107965Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/one_hot_categorical.py::OneHotCategorical:0, line 34 <- wrt source file 2025-12-04T10:03:50.0108972Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/one_hot_categorical.py::OneHotCategorical:0 2025-12-04T10:03:50.0109850Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/pareto.py::Pareto:0, line 20 <- wrt source file 2025-12-04T10:03:50.0110751Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/pareto.py::Pareto:0 2025-12-04T10:03:50.0111742Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/poisson.py::Poisson:0, line 25 <- wrt source file 2025-12-04T10:03:50.0112966Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/poisson.py::Poisson:0 2025-12-04T10:03:50.0114077Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/relaxed_bernoulli.py::RelaxedBernoulli:0, line 130 <- wrt source file 2025-12-04T10:03:50.0115397Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/relaxed_bernoulli.py::RelaxedBernoulli:0 2025-12-04T10:03:50.0116410Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/relaxed_categorical.py::RelaxedOneHotCategorical:0, line 117 <- wrt source file 2025-12-04T10:03:50.0117473Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/relaxed_categorical.py::RelaxedOneHotCategorical:0 2025-12-04T10:03:50.0118385Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/studentT.py::StudentT:0, line 22 <- wrt source file 2025-12-04T10:03:50.0119374Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/studentT.py::StudentT:0 2025-12-04T10:03:50.0120302Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/transforms.py::CatTransform:0, line 1076 <- wrt source file 2025-12-04T10:03:50.0121246Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/transforms.py::CatTransform:0 2025-12-04T10:03:50.0122131Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/transforms.py::StackTransform:0, line 1190 <- wrt source file 2025-12-04T10:03:50.0123053Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/transforms.py::StackTransform:0 2025-12-04T10:03:50.0124027Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/transforms.py::CumulativeDistributionTransform:0, line 1268 <- wrt source file 2025-12-04T10:03:50.0125090Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/transforms.py::CumulativeDistributionTransform:0 2025-12-04T10:03:50.0126003Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/uniform.py::Uniform:0, line 21 <- wrt source file 2025-12-04T10:03:50.0126845Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/uniform.py::Uniform:0 2025-12-04T10:03:50.0127666Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/utils.py::clamp_probs:0, line 114 <- wrt source file 2025-12-04T10:03:50.0130794Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/utils.py::clamp_probs:0 2025-12-04T10:03:50.0131648Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/von_mises.py::VonMises:0, line 119 <- wrt source file 2025-12-04T10:03:50.0139189Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/von_mises.py::VonMises:0 2025-12-04T10:03:50.0140257Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/weibull.py::Weibull:0, line 22 <- wrt source file 2025-12-04T10:03:50.0144862Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/weibull.py::Weibull:0 2025-12-04T10:03:50.0145816Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/wishart.py::Wishart:0, line 39 <- wrt source file 2025-12-04T10:03:50.0146911Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributions/wishart.py::Wishart:0 2025-12-04T10:03:50.0148015Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/_unlift.py::_convert_guards_code_to_fn:0, line 158 <- wrt source file 2025-12-04T10:03:50.0149071Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/_unlift.py::_convert_guards_code_to_fn:0 2025-12-04T10:03:50.0150103Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/dynamic_shapes.py::Dim:0, line 123 <- wrt source file 2025-12-04T10:03:50.0151038Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/dynamic_shapes.py::Dim:0 2025-12-04T10:03:50.0152002Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/dynamic_shapes.py::ShapesCollection:0, line 737 <- wrt source file 2025-12-04T10:03:50.0153044Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/dynamic_shapes.py::ShapesCollection:0 2025-12-04T10:03:50.0154037Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/dynamic_shapes.py::ShapesCollection:1, line 753 <- wrt source file 2025-12-04T10:03:50.0155088Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/dynamic_shapes.py::ShapesCollection:1 2025-12-04T10:03:50.0156693Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/dynamic_shapes.py::AdditionalInputs:0, line 837 <- wrt source file 2025-12-04T10:03:50.0158115Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/export/dynamic_shapes.py::AdditionalInputs:0 2025-12-04T10:03:50.0158981Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph.py::_snake_case:0, line 104 <- wrt source file 2025-12-04T10:03:50.0159748Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph.py::_snake_case:0 2025-12-04T10:03:50.0160551Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph.py::Graph.eliminate_dead_code:0, line 2043 <- wrt source file 2025-12-04T10:03:50.0161632Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph.py::Graph.eliminate_dead_code:0 2025-12-04T10:03:50.0162493Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph.py::Graph.on_generate_code:0, line 2137 <- wrt source file 2025-12-04T10:03:50.0163335Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/graph.py::Graph.on_generate_code:0 2025-12-04T10:03:50.0164137Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/interpreter.py::Interpreter:0, line 75 <- wrt source file 2025-12-04T10:03:50.0165335Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/interpreter.py::Interpreter:0 2025-12-04T10:03:50.0166355Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/interpreter.py::Transformer:0, line 519 <- wrt source file 2025-12-04T10:03:50.0167356Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/interpreter.py::Transformer:0 2025-12-04T10:03:50.0168316Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/subgraph_rewriter.py::replace_pattern:0, line 126 <- wrt source file 2025-12-04T10:03:50.0169328Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/subgraph_rewriter.py::replace_pattern:0 2025-12-04T10:03:50.0170354Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/tensor_type.py::TensorType:0, line 12 <- wrt source file 2025-12-04T10:03:50.0171307Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/tensor_type.py::TensorType:0 2025-12-04T10:03:50.0172294Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/tensor_type.py::is_consistent:0, line 65 <- wrt source file 2025-12-04T10:03:50.0173321Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/tensor_type.py::is_consistent:0 2025-12-04T10:03:50.0174121Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/tensor_type.py::is_more_precise:0, line 93 <- wrt source file 2025-12-04T10:03:50.0175058Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/tensor_type.py::is_more_precise:0 2025-12-04T10:03:50.0175841Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/traceback.py::annotate:0, line 300 <- wrt source file 2025-12-04T10:03:50.0176630Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/traceback.py::annotate:0 2025-12-04T10:03:50.0177389Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/traceback.py::annotate_fn:0, line 344 <- wrt source file 2025-12-04T10:03:50.0178177Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/traceback.py::annotate_fn:0 2025-12-04T10:03:50.0179060Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/rewriter.py::AST_Rewriter.visit_AnnAssign:0, line 97 <- wrt source file 2025-12-04T10:03:50.0180187Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/rewriter.py::AST_Rewriter.visit_AnnAssign:0 2025-12-04T10:03:50.0181127Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/core.py::reify:0, line 58 <- wrt source file 2025-12-04T10:03:50.0182032Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/core.py::reify:0 2025-12-04T10:03:50.0182953Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/match.py::VarDispatcher:0, line 48 <- wrt source file 2025-12-04T10:03:50.0183948Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/match.py::VarDispatcher:0 2025-12-04T10:03:50.0184887Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/more.py::unifiable:0, line 19 <- wrt source file 2025-12-04T10:03:50.0185820Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/more.py::unifiable:0 2025-12-04T10:03:50.0186733Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/more.py::reify_object:0, line 45 <- wrt source file 2025-12-04T10:03:50.0187809Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/more.py::reify_object:0 2025-12-04T10:03:50.0188749Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/more.py::unify_object:0, line 102 <- wrt source file 2025-12-04T10:03:50.0189701Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/more.py::unify_object:0 2025-12-04T10:03:50.0190649Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::merge:0, line 37 <- wrt source file 2025-12-04T10:03:50.0203577Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::merge:0 2025-12-04T10:03:50.0204970Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::merge_with:0, line 64 <- wrt source file 2025-12-04T10:03:50.0207261Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::merge_with:0 2025-12-04T10:03:50.0208551Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::valmap:0, line 90 <- wrt source file 2025-12-04T10:03:50.0210759Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::valmap:0 2025-12-04T10:03:50.0212098Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::keymap:0, line 106 <- wrt source file 2025-12-04T10:03:50.0213965Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::keymap:0 2025-12-04T10:03:50.0214973Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::itemmap:0, line 122 <- wrt source file 2025-12-04T10:03:50.0217312Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::itemmap:0 2025-12-04T10:03:50.0218341Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::valfilter:0, line 138 <- wrt source file 2025-12-04T10:03:50.0221564Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::valfilter:0 2025-12-04T10:03:50.0222850Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::keyfilter:0, line 158 <- wrt source file 2025-12-04T10:03:50.0225918Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::keyfilter:0 2025-12-04T10:03:50.0227308Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::itemfilter:0, line 178 <- wrt source file 2025-12-04T10:03:50.0230880Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::itemfilter:0 2025-12-04T10:03:50.0232125Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::assoc:0, line 204 <- wrt source file 2025-12-04T10:03:50.0233652Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::assoc:0 2025-12-04T10:03:50.0234644Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::dissoc:0, line 221 <- wrt source file 2025-12-04T10:03:50.0238497Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::dissoc:0 2025-12-04T10:03:50.0239499Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::assoc_in:0, line 247 <- wrt source file 2025-12-04T10:03:50.0242332Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::assoc_in:0 2025-12-04T10:03:50.0243378Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::update_in:0, line 275 <- wrt source file 2025-12-04T10:03:50.0249287Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::update_in:0 2025-12-04T10:03:50.0250397Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::get_in:0, line 329 <- wrt source file 2025-12-04T10:03:50.0257979Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::get_in:0 2025-12-04T10:03:50.0259001Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::groupby:0, line 376 <- wrt source file 2025-12-04T10:03:50.0262121Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::groupby:0 2025-12-04T10:03:50.0263132Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::first:0, line 417 <- wrt source file 2025-12-04T10:03:50.0264959Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/unification_tools.py::first:0 2025-12-04T10:03:50.0265934Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/utils.py::transitive_get:0, line 15 <- wrt source file 2025-12-04T10:03:50.0269563Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/utils.py::transitive_get:0 2025-12-04T10:03:50.0270547Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/utils.py::_toposort:0, line 42 <- wrt source file 2025-12-04T10:03:50.0271719Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/utils.py::_toposort:0 2025-12-04T10:03:50.0272885Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/utils.py::reverse_dict:0, line 70 <- wrt source file 2025-12-04T10:03:50.0273956Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/utils.py::reverse_dict:0 2025-12-04T10:03:50.0274895Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/utils.py::freeze:0, line 95 <- wrt source file 2025-12-04T10:03:50.0277394Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/utils.py::freeze:0 2025-12-04T10:03:50.0278345Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/variable.py::variables:0, line 67 <- wrt source file 2025-12-04T10:03:50.0279505Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/variable.py::variables:0 2025-12-04T10:03:50.0280548Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/core.py::dispatch:0, line 28 <- wrt source file 2025-12-04T10:03:50.0282978Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/core.py::dispatch:0 2025-12-04T10:03:50.0284216Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/dispatcher.py::Dispatcher:0, line 113 <- wrt source file 2025-12-04T10:03:50.0285523Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/dispatcher.py::Dispatcher:0 2025-12-04T10:03:50.0286809Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/dispatcher.py::Dispatcher.register:0, line 138 <- wrt source file 2025-12-04T10:03:50.0288154Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/dispatcher.py::Dispatcher.register:0 2025-12-04T10:03:50.0289406Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/dispatcher.py::Dispatcher.add:0, line 191 <- wrt source file 2025-12-04T10:03:50.0290686Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/dispatcher.py::Dispatcher.add:0 2025-12-04T10:03:50.0291849Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/dispatcher.py::Dispatcher.dispatch:0, line 305 <- wrt source file 2025-12-04T10:03:50.0293178Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/dispatcher.py::Dispatcher.dispatch:0 2025-12-04T10:03:50.0294443Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/dispatcher.py::str_signature:0, line 436 <- wrt source file 2025-12-04T10:03:50.0295601Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/dispatcher.py::str_signature:0 2025-12-04T10:03:50.0296713Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/utils.py::expand_tuples:0, line 18 <- wrt source file 2025-12-04T10:03:50.0297959Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/utils.py::expand_tuples:0 2025-12-04T10:03:50.0299181Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/utils.py::_toposort:0, line 41 <- wrt source file 2025-12-04T10:03:50.0300291Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/utils.py::_toposort:0 2025-12-04T10:03:50.0301378Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/utils.py::reverse_dict:0, line 68 <- wrt source file 2025-12-04T10:03:50.0302622Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/utils.py::reverse_dict:0 2025-12-04T10:03:50.0303687Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/utils.py::groupby:0, line 87 <- wrt source file 2025-12-04T10:03:50.0306798Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/utils.py::groupby:0 2025-12-04T10:03:50.0307996Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/utils.py::typename:0, line 117 <- wrt source file 2025-12-04T10:03:50.0310867Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/utils.py::typename:0 2025-12-04T10:03:50.0312083Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/variadic.py::isvariadic:0, line 47 <- wrt source file 2025-12-04T10:03:50.0313654Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/variadic.py::isvariadic:0 2025-12-04T10:03:50.0315031Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/variadic.py::Variadic:0, line 83 <- wrt source file 2025-12-04T10:03:50.0316404Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/unification/multipledispatch/variadic.py::Variadic:0 2025-12-04T10:03:50.0317746Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/passes/graph_drawer.py::FxGraphDrawer.get_dot_graph:0, line 129 <- wrt source file 2025-12-04T10:03:50.0366055Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/passes/graph_drawer.py::FxGraphDrawer.get_dot_graph:0 2025-12-04T10:03:50.0367974Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/passes/shape_prop.py::ShapeProp:0, line 99 <- wrt source file 2025-12-04T10:03:50.0369583Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/passes/shape_prop.py::ShapeProp:0 2025-12-04T10:03:50.0371506Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/passes/split_module.py::split_module:0, line 94 <- wrt source file 2025-12-04T10:03:50.0373801Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/passes/split_module.py::split_module:0 2025-12-04T10:03:50.0374860Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/passes/utils/matcher_with_name_node_map_utils.py::SubgraphMatcherWithNameNodeMap:0, line 51 <- wrt source file 2025-12-04T10:03:50.0376062Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/passes/utils/matcher_with_name_node_map_utils.py::SubgraphMatcherWithNameNodeMap:0 2025-12-04T10:03:50.0377090Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/_check.py::AttributeTypeIsSupportedChecker:0, line 37 <- wrt source file 2025-12-04T10:03:50.0378204Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/_check.py::AttributeTypeIsSupportedChecker:0 2025-12-04T10:03:50.0379133Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/mobile/__init__.py::_load_for_lite_interpreter:0, line 22 <- wrt source file 2025-12-04T10:03:50.0380067Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/mobile/__init__.py::_load_for_lite_interpreter:0 2025-12-04T10:03:50.0381012Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/mobile/__init__.py::_get_mobile_model_contained_types:0, line 125 <- wrt source file 2025-12-04T10:03:50.0381982Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/mobile/__init__.py::_get_mobile_model_contained_types:0 2025-12-04T10:03:50.0382892Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/mobile/__init__.py::_get_model_ops_and_info:0, line 225 <- wrt source file 2025-12-04T10:03:50.0384879Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/jit/mobile/__init__.py::_get_model_ops_and_info:0 2025-12-04T10:03:50.0385739Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/masked/_ops.py::logaddexp:0, line 1538 <- wrt source file 2025-12-04T10:03:50.0390700Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/masked/_ops.py::logaddexp:0 2025-12-04T10:03:50.0391585Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/masked/maskedtensor/core.py::is_masked_tensor:0, line 25 <- wrt source file 2025-12-04T10:03:50.0392710Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/masked/maskedtensor/core.py::is_masked_tensor:0 2025-12-04T10:03:50.0393852Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::fractional_max_pool2d_with_indices:0, line 470 <- wrt source file 2025-12-04T10:03:50.0426290Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::fractional_max_pool2d_with_indices:0 2025-12-04T10:03:50.0427803Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::fractional_max_pool3d_with_indices:0, line 589 <- wrt source file 2025-12-04T10:03:50.1064616Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::fractional_max_pool3d_with_indices:0 2025-12-04T10:03:50.1090420Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::gumbel_softmax:0, line 2198 <- wrt source file 2025-12-04T10:03:50.1100966Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::gumbel_softmax:0 2025-12-04T10:03:50.1102369Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::embedding:0, line 2503 <- wrt source file 2025-12-04T10:03:50.1110654Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::embedding:0 2025-12-04T10:03:50.1111718Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::embedding_bag:0, line 2645 <- wrt source file 2025-12-04T10:03:50.1122658Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::embedding_bag:0 2025-12-04T10:03:50.1123685Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::ctc_loss:0, line 3087 <- wrt source file 2025-12-04T10:03:50.1137347Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::ctc_loss:0 2025-12-04T10:03:50.1138520Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::nll_loss:0, line 3157 <- wrt source file 2025-12-04T10:03:50.1144472Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::nll_loss:0 2025-12-04T10:03:50.1145507Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::cross_entropy:0, line 3476 <- wrt source file 2025-12-04T10:03:50.1154396Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::cross_entropy:0 2025-12-04T10:03:50.1155706Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::binary_cross_entropy:0, line 3542 <- wrt source file 2025-12-04T10:03:50.1161673Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::binary_cross_entropy:0 2025-12-04T10:03:50.1162854Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::binary_cross_entropy_with_logits:0, line 3613 <- wrt source file 2025-12-04T10:03:50.1168689Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::binary_cross_entropy_with_logits:0 2025-12-04T10:03:50.1169799Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::pad:0, line 5387 <- wrt source file 2025-12-04T10:03:50.1180473Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/functional.py::pad:0 2025-12-04T10:03:50.1181453Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv1d_input:0, line 32 <- wrt source file 2025-12-04T10:03:50.1190221Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv1d_input:0 2025-12-04T10:03:50.1191220Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv1d_weight:0, line 79 <- wrt source file 2025-12-04T10:03:50.1195638Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv1d_weight:0 2025-12-04T10:03:50.1196618Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv2d_input:0, line 130 <- wrt source file 2025-12-04T10:03:50.1203746Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv2d_input:0 2025-12-04T10:03:50.1204751Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv2d_weight:0, line 177 <- wrt source file 2025-12-04T10:03:50.1209088Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv2d_weight:0 2025-12-04T10:03:50.1210061Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv3d_input:0, line 228 <- wrt source file 2025-12-04T10:03:50.1238791Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv3d_input:0 2025-12-04T10:03:50.1239798Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv3d_weight:0, line 275 <- wrt source file 2025-12-04T10:03:50.1256463Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/grad.py::conv3d_weight:0 2025-12-04T10:03:50.1257478Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::calculate_gain:0, line 172 <- wrt source file 2025-12-04T10:03:50.1260742Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::calculate_gain:0 2025-12-04T10:03:50.1261726Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::uniform_:0, line 231 <- wrt source file 2025-12-04T10:03:50.1264750Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::uniform_:0 2025-12-04T10:03:50.1265952Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::normal_:0, line 258 <- wrt source file 2025-12-04T10:03:50.1268680Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::normal_:0 2025-12-04T10:03:50.1269652Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::trunc_normal_:0, line 293 <- wrt source file 2025-12-04T10:03:50.1273059Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::trunc_normal_:0 2025-12-04T10:03:50.1274043Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::constant_:0, line 307 <- wrt source file 2025-12-04T10:03:50.1276890Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::constant_:0 2025-12-04T10:03:50.1277831Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::ones_:0, line 324 <- wrt source file 2025-12-04T10:03:50.1280386Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::ones_:0 2025-12-04T10:03:50.1281321Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::zeros_:0, line 337 <- wrt source file 2025-12-04T10:03:50.1283980Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::zeros_:0 2025-12-04T10:03:50.1284903Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::eye_:0, line 353 <- wrt source file 2025-12-04T10:03:50.1287863Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::eye_:0 2025-12-04T10:03:50.1288805Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::dirac_:0, line 375 <- wrt source file 2025-12-04T10:03:50.1293318Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::dirac_:0 2025-12-04T10:03:50.1294310Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::xavier_uniform_:0, line 461 <- wrt source file 2025-12-04T10:03:50.1297519Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::xavier_uniform_:0 2025-12-04T10:03:50.1298651Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::xavier_normal_:0, line 493 <- wrt source file 2025-12-04T10:03:50.1301059Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::xavier_normal_:0 2025-12-04T10:03:50.1302066Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::kaiming_uniform_:0, line 545 <- wrt source file 2025-12-04T10:03:50.1305057Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::kaiming_uniform_:0 2025-12-04T10:03:50.1306232Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::kaiming_normal_:0, line 610 <- wrt source file 2025-12-04T10:03:50.1309077Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::kaiming_normal_:0 2025-12-04T10:03:50.1310080Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::orthogonal_:0, line 649 <- wrt source file 2025-12-04T10:03:50.1311070Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::orthogonal_:0 2025-12-04T10:03:50.1311906Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::sparse_:0, line 702 <- wrt source file 2025-12-04T10:03:50.1315266Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/init.py::sparse_:0 2025-12-04T10:03:50.1316189Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/attention/__init__.py::sdpa_kernel:0, line 124 <- wrt source file 2025-12-04T10:03:50.1317120Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/attention/__init__.py::sdpa_kernel:0 2025-12-04T10:03:50.1318030Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/attention/_registry.py::register_flash_attention_impl:0, line 47 <- wrt source file 2025-12-04T10:03:50.1319005Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/attention/_registry.py::register_flash_attention_impl:0 2025-12-04T10:03:50.1319959Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/attention/_registry.py::activate_flash_attention_impl:0, line 78 <- wrt source file 2025-12-04T10:03:50.1320946Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/attention/_registry.py::activate_flash_attention_impl:0 2025-12-04T10:03:50.1321826Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/attention/bias.py::CausalBias:0, line 94 <- wrt source file 2025-12-04T10:03:50.1322649Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/attention/bias.py::CausalBias:0 2025-12-04T10:03:50.1323463Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/attention/varlen.py::varlen_attn:0, line 166 <- wrt source file 2025-12-04T10:03:50.1324317Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/attention/varlen.py::varlen_attn:0 2025-12-04T10:03:50.1325140Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Threshold:0, line 72 <- wrt source file 2025-12-04T10:03:50.1325975Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Threshold:0 2025-12-04T10:03:50.1326778Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::ReLU:0, line 120 <- wrt source file 2025-12-04T10:03:50.1330796Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::ReLU:0 2025-12-04T10:03:50.1331680Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::RReLU:0, line 185 <- wrt source file 2025-12-04T10:03:50.1335248Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::RReLU:0 2025-12-04T10:03:50.1336064Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Hardtanh:0, line 247 <- wrt source file 2025-12-04T10:03:50.1339647Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Hardtanh:0 2025-12-04T10:03:50.1340491Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::ReLU6:0, line 318 <- wrt source file 2025-12-04T10:03:50.1343936Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::ReLU6:0 2025-12-04T10:03:50.1344990Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Sigmoid:0, line 349 <- wrt source file 2025-12-04T10:03:50.1348505Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Sigmoid:0 2025-12-04T10:03:50.1352580Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Hardsigmoid:0, line 384 <- wrt source file 2025-12-04T10:03:50.1353474Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Hardsigmoid:0 2025-12-04T10:03:50.1354374Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Tanh:0, line 420 <- wrt source file 2025-12-04T10:03:50.1356582Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Tanh:0 2025-12-04T10:03:50.1357398Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::SiLU:0, line 456 <- wrt source file 2025-12-04T10:03:50.1360687Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::SiLU:0 2025-12-04T10:03:50.1361488Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Mish:0, line 501 <- wrt source file 2025-12-04T10:03:50.1364703Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Mish:0 2025-12-04T10:03:50.1365719Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Hardswish:0, line 552 <- wrt source file 2025-12-04T10:03:50.1368554Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Hardswish:0 2025-12-04T10:03:50.1369579Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::ELU:0, line 598 <- wrt source file 2025-12-04T10:03:50.1372384Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::ELU:0 2025-12-04T10:03:50.1373178Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::CELU:0, line 646 <- wrt source file 2025-12-04T10:03:50.1376344Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::CELU:0 2025-12-04T10:03:50.1377160Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::SELU:0, line 705 <- wrt source file 2025-12-04T10:03:50.1380117Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::SELU:0 2025-12-04T10:03:50.1380906Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::GLU:0, line 751 <- wrt source file 2025-12-04T10:03:50.1384274Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::GLU:0 2025-12-04T10:03:50.1385075Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::GELU:0, line 799 <- wrt source file 2025-12-04T10:03:50.1390898Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::GELU:0 2025-12-04T10:03:50.1392064Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Hardshrink:0, line 848 <- wrt source file 2025-12-04T10:03:50.1395878Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Hardshrink:0 2025-12-04T10:03:50.1397278Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::LeakyReLU:0, line 903 <- wrt source file 2025-12-04T10:03:50.1400126Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::LeakyReLU:0 2025-12-04T10:03:50.1401506Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::LogSigmoid:0, line 945 <- wrt source file 2025-12-04T10:03:50.1404202Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::LogSigmoid:0 2025-12-04T10:03:50.1405292Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softplus:0, line 981 <- wrt source file 2025-12-04T10:03:50.1408635Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softplus:0 2025-12-04T10:03:50.1409817Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softshrink:0, line 1030 <- wrt source file 2025-12-04T10:03:50.1413015Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softshrink:0 2025-12-04T10:03:50.1414180Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::MultiheadAttention:0, line 1148 <- wrt source file 2025-12-04T10:03:50.1415400Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::MultiheadAttention:0 2025-12-04T10:03:50.1416559Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::PReLU:0, line 1613 <- wrt source file 2025-12-04T10:03:50.1418112Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::PReLU:0 2025-12-04T10:03:50.1419187Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softsign:0, line 1664 <- wrt source file 2025-12-04T10:03:50.1422856Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softsign:0 2025-12-04T10:03:50.1423962Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Tanhshrink:0, line 1690 <- wrt source file 2025-12-04T10:03:50.1427082Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Tanhshrink:0 2025-12-04T10:03:50.1428300Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softmin:0, line 1728 <- wrt source file 2025-12-04T10:03:50.1431815Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softmin:0 2025-12-04T10:03:50.1432910Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softmax:0, line 1792 <- wrt source file 2025-12-04T10:03:50.1436135Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softmax:0 2025-12-04T10:03:50.1437299Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softmax2d:0, line 1839 <- wrt source file 2025-12-04T10:03:50.1440682Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::Softmax2d:0 2025-12-04T10:03:50.1441826Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::LogSoftmax:0, line 1878 <- wrt source file 2025-12-04T10:03:50.1445066Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/activation.py::LogSoftmax:0 2025-12-04T10:03:50.1446270Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py::BatchNorm1d:0, line 341 <- wrt source file 2025-12-04T10:03:50.1453071Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py::BatchNorm1d:0 2025-12-04T10:03:50.1454178Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py::BatchNorm2d:0, line 453 <- wrt source file 2025-12-04T10:03:50.1615265Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py::BatchNorm2d:0 2025-12-04T10:03:50.1616418Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py::BatchNorm3d:0, line 565 <- wrt source file 2025-12-04T10:03:50.3311273Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py::BatchNorm3d:0 2025-12-04T10:03:50.3496924Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py::SyncBatchNorm:0, line 690 <- wrt source file 2025-12-04T10:03:50.3498516Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py::SyncBatchNorm:0 2025-12-04T10:03:50.3499734Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py::SyncBatchNorm.convert_sync_batchnorm:0, line 857 <- wrt source file 2025-12-04T10:03:50.3501058Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/batchnorm.py::SyncBatchNorm.convert_sync_batchnorm:0 2025-12-04T10:03:50.3502261Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/channelshuffle.py::ChannelShuffle:0, line 21 <- wrt source file 2025-12-04T10:03:50.3523311Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/channelshuffle.py::ChannelShuffle:0 2025-12-04T10:03:50.3524809Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::Sequential:0, line 81 <- wrt source file 2025-12-04T10:03:50.3526022Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::Sequential:0 2025-12-04T10:03:50.3527189Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::Sequential.append:0, line 263 <- wrt source file 2025-12-04T10:03:50.3534412Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::Sequential.append:0 2025-12-04T10:03:50.3535603Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::Sequential.insert:0, line 286 <- wrt source file 2025-12-04T10:03:50.3543271Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::Sequential.insert:0 2025-12-04T10:03:50.3544457Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::Sequential.extend:0, line 317 <- wrt source file 2025-12-04T10:03:50.3553738Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::Sequential.extend:0 2025-12-04T10:03:50.3554932Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::ModuleList:0, line 346 <- wrt source file 2025-12-04T10:03:50.3556361Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::ModuleList:0 2025-12-04T10:03:50.3557456Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::ModuleDict:0, line 529 <- wrt source file 2025-12-04T10:03:50.3558708Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::ModuleDict:0 2025-12-04T10:03:50.3559805Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::ParameterList:0, line 661 <- wrt source file 2025-12-04T10:03:50.3560954Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::ParameterList:0 2025-12-04T10:03:50.3562080Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::ParameterDict:0, line 819 <- wrt source file 2025-12-04T10:03:50.3563215Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/container.py::ParameterDict:0 2025-12-04T10:03:50.3564342Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/distance.py::PairwiseDistance:0, line 38 <- wrt source file 2025-12-04T10:03:50.3565867Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/distance.py::PairwiseDistance:0 2025-12-04T10:03:50.3567017Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/distance.py::CosineSimilarity:0, line 81 <- wrt source file 2025-12-04T10:03:50.3573085Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/distance.py::CosineSimilarity:0 2025-12-04T10:03:50.3573973Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::Dropout:0, line 60 <- wrt source file 2025-12-04T10:03:50.3577893Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::Dropout:0 2025-12-04T10:03:50.3578744Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::Dropout1d:0, line 108 <- wrt source file 2025-12-04T10:03:50.3582895Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::Dropout1d:0 2025-12-04T10:03:50.3583762Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::Dropout2d:0, line 163 <- wrt source file 2025-12-04T10:03:50.3599738Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::Dropout2d:0 2025-12-04T10:03:50.3601104Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::Dropout3d:0, line 211 <- wrt source file 2025-12-04T10:03:50.3666594Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::Dropout3d:0 2025-12-04T10:03:50.3668095Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::AlphaDropout:0, line 257 <- wrt source file 2025-12-04T10:03:50.3671979Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::AlphaDropout:0 2025-12-04T10:03:50.3673176Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::FeatureAlphaDropout:0, line 309 <- wrt source file 2025-12-04T10:03:50.3738700Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/dropout.py::FeatureAlphaDropout:0 2025-12-04T10:03:50.3740278Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/flatten.py::Flatten:0, line 29 <- wrt source file 2025-12-04T10:03:50.3745632Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/flatten.py::Flatten:0 2025-12-04T10:03:50.3746687Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/flatten.py::Unflatten:0, line 86 <- wrt source file 2025-12-04T10:03:50.3762120Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/flatten.py::Unflatten:0 2025-12-04T10:03:50.3763289Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/fold.py::Fold:0, line 224 <- wrt source file 2025-12-04T10:03:50.3768254Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/fold.py::Fold:0 2025-12-04T10:03:50.3769259Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/fold.py::Unfold:0, line 395 <- wrt source file 2025-12-04T10:03:50.3784170Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/fold.py::Unfold:0 2025-12-04T10:03:50.3785043Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/instancenorm.py::InstanceNorm1d:0, line 188 <- wrt source file 2025-12-04T10:03:50.3796948Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/instancenorm.py::InstanceNorm1d:0 2025-12-04T10:03:50.3798412Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/instancenorm.py::InstanceNorm2d:0, line 304 <- wrt source file 2025-12-04T10:03:50.3918767Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/instancenorm.py::InstanceNorm2d:0 2025-12-04T10:03:50.3919989Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/instancenorm.py::InstanceNorm3d:0, line 420 <- wrt source file 2025-12-04T10:03:50.5573739Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/instancenorm.py::InstanceNorm3d:0 2025-12-04T10:03:50.5757530Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/lazy.py::LazyModuleMixin:0, line 77 <- wrt source file 2025-12-04T10:03:50.5761909Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/lazy.py::LazyModuleMixin:0 2025-12-04T10:03:50.5763038Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/linear.py::Identity:0, line 34 <- wrt source file 2025-12-04T10:03:50.5769699Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/linear.py::Identity:0 2025-12-04T10:03:50.5770765Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/linear.py::Linear:0, line 83 <- wrt source file 2025-12-04T10:03:50.5779091Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/linear.py::Linear:0 2025-12-04T10:03:50.5780399Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/linear.py::Bilinear:0, line 191 <- wrt source file 2025-12-04T10:03:50.5799038Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/linear.py::Bilinear:0 2025-12-04T10:03:50.5800100Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::L1Loss:0, line 116 <- wrt source file 2025-12-04T10:03:50.5807861Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::L1Loss:0 2025-12-04T10:03:50.5809162Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::NLLLoss:0, line 213 <- wrt source file 2025-12-04T10:03:50.5833059Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::NLLLoss:0 2025-12-04T10:03:50.5834136Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::PoissonNLLLoss:0, line 327 <- wrt source file 2025-12-04T10:03:50.5840089Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::PoissonNLLLoss:0 2025-12-04T10:03:50.5841187Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::GaussianNLLLoss:0, line 416 <- wrt source file 2025-12-04T10:03:50.5854802Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::GaussianNLLLoss:0 2025-12-04T10:03:50.5856107Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::KLDivLoss:0, line 531 <- wrt source file 2025-12-04T10:03:50.5863948Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::KLDivLoss:0 2025-12-04T10:03:50.5864975Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::MSELoss:0, line 613 <- wrt source file 2025-12-04T10:03:50.5870848Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::MSELoss:0 2025-12-04T10:03:50.5871864Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::BCELoss:0, line 696 <- wrt source file 2025-12-04T10:03:50.5878014Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::BCELoss:0 2025-12-04T10:03:50.5879101Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::BCEWithLogitsLoss:0, line 762 <- wrt source file 2025-12-04T10:03:50.5889794Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::BCEWithLogitsLoss:0 2025-12-04T10:03:50.5890907Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::BCEWithLogitsLoss:1, line 810 <- wrt source file 2025-12-04T10:03:50.5896213Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::BCEWithLogitsLoss:1 2025-12-04T10:03:50.5897343Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::MultiLabelMarginLoss:0, line 958 <- wrt source file 2025-12-04T10:03:50.5904934Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::MultiLabelMarginLoss:0 2025-12-04T10:03:50.5906070Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::CrossEntropyLoss:0, line 1284 <- wrt source file 2025-12-04T10:03:50.5914820Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::CrossEntropyLoss:0 2025-12-04T10:03:50.5915931Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::CrossEntropyLoss:1, line 1311 <- wrt source file 2025-12-04T10:03:50.5917624Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::CrossEntropyLoss:1 2025-12-04T10:03:50.5918745Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::CosineEmbeddingLoss:0, line 1464 <- wrt source file 2025-12-04T10:03:50.5927246Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::CosineEmbeddingLoss:0 2025-12-04T10:03:50.5928370Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::MarginRankingLoss:0, line 1531 <- wrt source file 2025-12-04T10:03:50.5934703Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::MarginRankingLoss:0 2025-12-04T10:03:50.5935829Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::MultiMarginLoss:0, line 1612 <- wrt source file 2025-12-04T10:03:50.5943113Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::MultiMarginLoss:0 2025-12-04T10:03:50.5944213Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::TripletMarginLoss:0, line 1714 <- wrt source file 2025-12-04T10:03:50.5954282Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::TripletMarginLoss:0 2025-12-04T10:03:50.5955680Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::TripletMarginWithDistanceLoss:0, line 1827 <- wrt source file 2025-12-04T10:03:50.5972269Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::TripletMarginWithDistanceLoss:0 2025-12-04T10:03:50.5973401Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::CTCLoss:0, line 1959 <- wrt source file 2025-12-04T10:03:50.5995575Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/loss.py::CTCLoss:0 2025-12-04T10:03:50.5996741Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.register_buffer:0, line 554 <- wrt source file 2025-12-04T10:03:50.5998272Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.register_buffer:0 2025-12-04T10:03:50.5999419Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.apply:0, line 1048 <- wrt source file 2025-12-04T10:03:50.6011243Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.apply:0 2025-12-04T10:03:50.6012330Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.to:0, line 1299 <- wrt source file 2025-12-04T10:03:50.6019801Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.to:0 2025-12-04T10:03:50.6020915Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.state_dict:0, line 2232 <- wrt source file 2025-12-04T10:03:50.6022114Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.state_dict:0 2025-12-04T10:03:50.6023249Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.parameters:0, line 2678 <- wrt source file 2025-12-04T10:03:50.6024409Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.parameters:0 2025-12-04T10:03:50.6025561Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.named_parameters:0, line 2706 <- wrt source file 2025-12-04T10:03:50.6026774Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.named_parameters:0 2025-12-04T10:03:50.6028014Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.buffers:0, line 2733 <- wrt source file 2025-12-04T10:03:50.6029135Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.buffers:0 2025-12-04T10:03:50.6030255Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.named_buffers:0, line 2760 <- wrt source file 2025-12-04T10:03:50.6031545Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.named_buffers:0 2025-12-04T10:03:50.6032702Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.named_children:0, line 2791 <- wrt source file 2025-12-04T10:03:50.6033875Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.named_children:0 2025-12-04T10:03:50.6034993Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.modules:0, line 2815 <- wrt source file 2025-12-04T10:03:50.6036213Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.modules:0 2025-12-04T10:03:50.6037328Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.named_modules:0, line 2853 <- wrt source file 2025-12-04T10:03:50.6038500Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py::Module.named_modules:0 2025-12-04T10:03:50.6039667Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/normalization.py::LocalResponseNorm:0, line 38 <- wrt source file 2025-12-04T10:03:50.6064523Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/normalization.py::LocalResponseNorm:0 2025-12-04T10:03:50.6065858Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/normalization.py::LayerNorm:0, line 163 <- wrt source file 2025-12-04T10:03:50.6074287Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/normalization.py::LayerNorm:0 2025-12-04T10:03:50.6075439Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/normalization.py::GroupNorm:0, line 274 <- wrt source file 2025-12-04T10:03:50.6081966Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/normalization.py::GroupNorm:0 2025-12-04T10:03:50.6083109Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/normalization.py::RMSNorm:0, line 369 <- wrt source file 2025-12-04T10:03:50.6087856Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/normalization.py::RMSNorm:0 2025-12-04T10:03:50.6088968Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::CircularPad1d:0, line 70 <- wrt source file 2025-12-04T10:03:50.6094823Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::CircularPad1d:0 2025-12-04T10:03:50.6095922Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::CircularPad2d:0, line 123 <- wrt source file 2025-12-04T10:03:50.6118043Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::CircularPad2d:0 2025-12-04T10:03:50.6119151Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::CircularPad3d:0, line 189 <- wrt source file 2025-12-04T10:03:51.0862271Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::CircularPad3d:0 2025-12-04T10:03:51.1232464Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ConstantPad1d:0, line 244 <- wrt source file 2025-12-04T10:03:51.1245751Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ConstantPad1d:0 2025-12-04T10:03:51.1246859Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ConstantPad2d:0, line 298 <- wrt source file 2025-12-04T10:03:51.1254300Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ConstantPad2d:0 2025-12-04T10:03:51.1255637Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ConstantPad3d:0, line 355 <- wrt source file 2025-12-04T10:03:51.1274678Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ConstantPad3d:0 2025-12-04T10:03:51.1276135Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReflectionPad1d:0, line 401 <- wrt source file 2025-12-04T10:03:51.1283016Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReflectionPad1d:0 2025-12-04T10:03:51.1284176Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReflectionPad2d:0, line 446 <- wrt source file 2025-12-04T10:03:51.1290080Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReflectionPad2d:0 2025-12-04T10:03:51.1291205Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReflectionPad3d:0, line 505 <- wrt source file 2025-12-04T10:03:51.1294779Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReflectionPad3d:0 2025-12-04T10:03:51.1295917Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReplicationPad1d:0, line 565 <- wrt source file 2025-12-04T10:03:51.1301412Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReplicationPad1d:0 2025-12-04T10:03:51.1302579Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReplicationPad2d:0, line 610 <- wrt source file 2025-12-04T10:03:51.1308762Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReplicationPad2d:0 2025-12-04T10:03:51.1309934Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReplicationPad3d:0, line 669 <- wrt source file 2025-12-04T10:03:51.4867929Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ReplicationPad3d:0 2025-12-04T10:03:51.5237661Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ZeroPad1d:0, line 704 <- wrt source file 2025-12-04T10:03:51.5251152Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ZeroPad1d:0 2025-12-04T10:03:51.5252225Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ZeroPad2d:0, line 762 <- wrt source file 2025-12-04T10:03:51.5258851Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ZeroPad2d:0 2025-12-04T10:03:51.5259676Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ZeroPad3d:0, line 824 <- wrt source file 2025-12-04T10:03:51.5279581Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/padding.py::ZeroPad3d:0 2025-12-04T10:03:51.5280995Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pixelshuffle.py::PixelShuffle:0, line 40 <- wrt source file 2025-12-04T10:03:51.5285689Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pixelshuffle.py::PixelShuffle:0 2025-12-04T10:03:51.5286899Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pixelshuffle.py::PixelUnshuffle:0, line 99 <- wrt source file 2025-12-04T10:03:51.5292016Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pixelshuffle.py::PixelUnshuffle:0 2025-12-04T10:03:51.5293173Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxPool1d:0, line 129 <- wrt source file 2025-12-04T10:03:51.5297527Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxPool1d:0 2025-12-04T10:03:51.5298591Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxPool2d:0, line 207 <- wrt source file 2025-12-04T10:03:51.5328673Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxPool2d:0 2025-12-04T10:03:51.5329751Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxPool3d:0, line 291 <- wrt source file 2025-12-04T10:03:51.6744537Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxPool3d:0 2025-12-04T10:03:51.6808994Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxUnpool1d:0, line 366 <- wrt source file 2025-12-04T10:03:51.6823586Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxUnpool1d:0 2025-12-04T10:03:51.6824695Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxUnpool2d:0, line 452 <- wrt source file 2025-12-04T10:03:51.6846578Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxUnpool2d:0 2025-12-04T10:03:51.6847679Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxUnpool3d:0, line 550 <- wrt source file 2025-12-04T10:03:51.7281890Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::MaxUnpool3d:0 2025-12-04T10:03:51.7283027Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AvgPool1d:0, line 642 <- wrt source file 2025-12-04T10:03:51.7293756Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AvgPool1d:0 2025-12-04T10:03:51.7294847Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AvgPool2d:0, line 738 <- wrt source file 2025-12-04T10:03:51.7378126Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AvgPool2d:0 2025-12-04T10:03:51.7379484Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AvgPool3d:0, line 855 <- wrt source file 2025-12-04T10:03:51.8529308Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AvgPool3d:0 2025-12-04T10:03:51.8595865Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::FractionalMaxPool2d:0, line 946 <- wrt source file 2025-12-04T10:03:51.8630610Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::FractionalMaxPool2d:0 2025-12-04T10:03:51.8632284Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::FractionalMaxPool3d:0, line 1033 <- wrt source file 2025-12-04T10:03:51.9064885Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::FractionalMaxPool3d:0 2025-12-04T10:03:51.9066490Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::LPPool1d:0, line 1156 <- wrt source file 2025-12-04T10:03:51.9074696Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::LPPool1d:0 2025-12-04T10:03:51.9076236Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::LPPool2d:0, line 1212 <- wrt source file 2025-12-04T10:03:51.9108656Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::LPPool2d:0 2025-12-04T10:03:51.9110208Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::LPPool3d:0, line 1276 <- wrt source file 2025-12-04T10:03:52.0593557Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::LPPool3d:0 2025-12-04T10:03:52.0659335Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveMaxPool1d:0, line 1332 <- wrt source file 2025-12-04T10:03:52.0668240Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveMaxPool1d:0 2025-12-04T10:03:52.0669927Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveMaxPool2d:0, line 1367 <- wrt source file 2025-12-04T10:03:52.0678990Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveMaxPool2d:0 2025-12-04T10:03:52.0680670Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveMaxPool3d:0, line 1411 <- wrt source file 2025-12-04T10:03:52.0696022Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveMaxPool3d:0 2025-12-04T10:03:52.0697721Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveAvgPool1d:0, line 1459 <- wrt source file 2025-12-04T10:03:52.0702130Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveAvgPool1d:0 2025-12-04T10:03:52.0703353Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveAvgPool2d:0, line 1493 <- wrt source file 2025-12-04T10:03:52.0710469Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveAvgPool2d:0 2025-12-04T10:03:52.0711658Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveAvgPool3d:0, line 1533 <- wrt source file 2025-12-04T10:03:52.0723754Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/pooling.py::AdaptiveAvgPool3d:0 2025-12-04T10:03:52.0724821Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::RNN:0, line 598 <- wrt source file 2025-12-04T10:03:52.0735942Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::RNN:0 2025-12-04T10:03:52.0736932Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::LSTM:0, line 963 <- wrt source file 2025-12-04T10:03:52.0960621Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::LSTM:0 2025-12-04T10:03:52.0961864Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::GRU:0, line 1305 <- wrt source file 2025-12-04T10:03:52.0976479Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::GRU:0 2025-12-04T10:03:52.0977546Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::RNNCell:0, line 1561 <- wrt source file 2025-12-04T10:03:52.0989354Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::RNNCell:0 2025-12-04T10:03:52.0990600Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::LSTMCell:0, line 1683 <- wrt source file 2025-12-04T10:03:52.0999272Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::LSTMCell:0 2025-12-04T10:03:52.1000294Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::GRUCell:0, line 1797 <- wrt source file 2025-12-04T10:03:52.1011727Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/rnn.py::GRUCell:0 2025-12-04T10:03:52.1012940Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py::Embedding:0, line 71 <- wrt source file 2025-12-04T10:03:52.1026663Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py::Embedding:0 2025-12-04T10:03:52.1027924Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py::Embedding.from_pretrained:0, line 243 <- wrt source file 2025-12-04T10:03:52.1033300Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py::Embedding.from_pretrained:0 2025-12-04T10:03:52.1034435Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py::EmbeddingBag:0, line 324 <- wrt source file 2025-12-04T10:03:52.1048093Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py::EmbeddingBag:0 2025-12-04T10:03:52.1049550Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py::EmbeddingBag.from_pretrained:0, line 523 <- wrt source file 2025-12-04T10:03:52.1055794Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/sparse.py::EmbeddingBag.from_pretrained:0 2025-12-04T10:03:52.1057010Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::Transformer:0, line 91 <- wrt source file 2025-12-04T10:03:52.7333769Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::Transformer:0 2025-12-04T10:03:52.7346158Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::Transformer.forward:0, line 267 <- wrt source file 2025-12-04T10:03:52.7347533Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::Transformer.forward:0 2025-12-04T10:03:52.7348788Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::TransformerEncoder:0, line 345 <- wrt source file 2025-12-04T10:03:52.8203672Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::TransformerEncoder:0 2025-12-04T10:03:52.8335627Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::TransformerDecoder:0, line 578 <- wrt source file 2025-12-04T10:03:53.0208527Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::TransformerDecoder:0 2025-12-04T10:03:53.0214316Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::TransformerEncoderLayer:0, line 702 <- wrt source file 2025-12-04T10:03:53.0447680Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::TransformerEncoderLayer:0 2025-12-04T10:03:53.0448997Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::TransformerDecoderLayer:0, line 1014 <- wrt source file 2025-12-04T10:03:53.0882978Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py::TransformerDecoderLayer:0 2025-12-04T10:03:53.0884587Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/upsampling.py::Upsample:0, line 77 <- wrt source file 2025-12-04T10:03:53.0909884Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/upsampling.py::Upsample:0 2025-12-04T10:03:53.0911061Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/upsampling.py::UpsamplingNearest2d:0, line 229 <- wrt source file 2025-12-04T10:03:53.0923140Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/upsampling.py::UpsamplingNearest2d:0 2025-12-04T10:03:53.0924392Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/upsampling.py::UpsamplingBilinear2d:0, line 279 <- wrt source file 2025-12-04T10:03:53.0932301Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/upsampling.py::UpsamplingBilinear2d:0 2025-12-04T10:03:53.0933545Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py::DataParallel:0, line 128 <- wrt source file 2025-12-04T10:03:53.0934494Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/data_parallel.py::DataParallel:0 2025-12-04T10:03:53.0935439Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel:0, line 644 <- wrt source file 2025-12-04T10:03:53.0936655Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel:0 2025-12-04T10:03:53.0937669Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel.no_sync:0, line 1451 <- wrt source file 2025-12-04T10:03:53.0938727Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel.no_sync:0 2025-12-04T10:03:53.0939734Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel.join:0, line 1838 <- wrt source file 2025-12-04T10:03:53.0940758Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel.join:0 2025-12-04T10:03:53.0941825Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel.register_comm_hook:0, line 2004 <- wrt source file 2025-12-04T10:03:53.0942937Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel.register_comm_hook:0 2025-12-04T10:03:53.0944024Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel.register_comm_hook:1, line 2014 <- wrt source file 2025-12-04T10:03:53.0945113Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel.register_comm_hook:1 2025-12-04T10:03:53.0946226Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel._register_builtin_comm_hook:0, line 2049 <- wrt source file 2025-12-04T10:03:53.0947457Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel._register_builtin_comm_hook:0 2025-12-04T10:03:53.0948582Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel._register_fused_optim:0, line 2107 <- wrt source file 2025-12-04T10:03:53.0949800Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/parallel/distributed.py::DistributedDataParallel._register_fused_optim:0 2025-12-04T10:03:53.0950803Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/_per_sample_grad.py::call_for_per_sample_grads:0, line 35 <- wrt source file 2025-12-04T10:03:53.0951759Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/_per_sample_grad.py::call_for_per_sample_grads:0 2025-12-04T10:03:53.0952593Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/init.py::skip_init:0, line 33 <- wrt source file 2025-12-04T10:03:53.0953542Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/init.py::skip_init:0 2025-12-04T10:03:53.0954437Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/memory_format.py::convert_conv2d_weight_memory_format:0, line 64 <- wrt source file 2025-12-04T10:03:53.0955616Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/memory_format.py::convert_conv2d_weight_memory_format:0 2025-12-04T10:03:53.0956622Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/memory_format.py::convert_conv3d_weight_memory_format:0, line 143 <- wrt source file 2025-12-04T10:03:53.0957636Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/memory_format.py::convert_conv3d_weight_memory_format:0 2025-12-04T10:03:53.0958716Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/parametrizations.py::orthogonal:0, line 267 <- wrt source file 2025-12-04T10:03:53.0959619Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/parametrizations.py::orthogonal:0 2025-12-04T10:03:53.0960491Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/parametrizations.py::weight_norm:0, line 362 <- wrt source file 2025-12-04T10:03:53.0967023Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/parametrizations.py::weight_norm:0 2025-12-04T10:03:53.0967940Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/parametrizations.py::spectral_norm:0, line 593 <- wrt source file 2025-12-04T10:03:53.0968879Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/parametrizations.py::spectral_norm:0 2025-12-04T10:03:53.0969711Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::identity:0, line 852 <- wrt source file 2025-12-04T10:03:53.0970500Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::identity:0 2025-12-04T10:03:53.0971304Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::random_unstructured:0, line 888 <- wrt source file 2025-12-04T10:03:53.0972164Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::random_unstructured:0 2025-12-04T10:03:53.0972993Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::l1_unstructured:0, line 931 <- wrt source file 2025-12-04T10:03:53.0973816Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::l1_unstructured:0 2025-12-04T10:03:53.0974628Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::random_structured:0, line 971 <- wrt source file 2025-12-04T10:03:53.0975468Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::random_structured:0 2025-12-04T10:03:53.0976359Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::ln_structured:0, line 1017 <- wrt source file 2025-12-04T10:03:53.0985726Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::ln_structured:0 2025-12-04T10:03:53.0986780Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::global_unstructured:0, line 1072 <- wrt source file 2025-12-04T10:03:53.1002584Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::global_unstructured:0 2025-12-04T10:03:53.1003764Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::custom_from_mask:0, line 1175 <- wrt source file 2025-12-04T10:03:53.1012340Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::custom_from_mask:0 2025-12-04T10:03:53.1013332Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::remove:0, line 1203 <- wrt source file 2025-12-04T10:03:53.1018931Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::remove:0 2025-12-04T10:03:53.1019913Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::is_pruned:0, line 1231 <- wrt source file 2025-12-04T10:03:53.1027764Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/prune.py::is_pruned:0 2025-12-04T10:03:53.1028577Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/rnn.py::pad_packed_sequence:0, line 359 <- wrt source file 2025-12-04T10:03:53.1044185Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/rnn.py::pad_packed_sequence:0 2025-12-04T10:03:53.1045213Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/rnn.py::pad_sequence:0, line 439 <- wrt source file 2025-12-04T10:03:53.1050377Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/rnn.py::pad_sequence:0 2025-12-04T10:03:53.1051374Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/rnn.py::unpad_sequence:0, line 500 <- wrt source file 2025-12-04T10:03:53.1063911Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/rnn.py::unpad_sequence:0 2025-12-04T10:03:53.1064963Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/rnn.py::pack_sequence:0, line 556 <- wrt source file 2025-12-04T10:03:53.1072213Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/rnn.py::pack_sequence:0 2025-12-04T10:03:53.1073282Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/rnn.py::unpack_sequence:0, line 584 <- wrt source file 2025-12-04T10:03:53.1090807Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/rnn.py::unpack_sequence:0 2025-12-04T10:03:53.1091930Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/spectral_norm.py::spectral_norm:0, line 314 <- wrt source file 2025-12-04T10:03:53.1098484Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/spectral_norm.py::spectral_norm:0 2025-12-04T10:03:53.1107052Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/spectral_norm.py::remove_spectral_norm:0, line 347 <- wrt source file 2025-12-04T10:03:53.1108420Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/spectral_norm.py::remove_spectral_norm:0 2025-12-04T10:03:53.1109600Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/stateless.py::functional_call:0, line 193 <- wrt source file 2025-12-04T10:03:53.1110868Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/stateless.py::functional_call:0 2025-12-04T10:03:53.1111974Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py::weight_norm:0, line 134 <- wrt source file 2025-12-04T10:03:53.1118576Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py::weight_norm:0 2025-12-04T10:03:53.1119719Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py::remove_weight_norm:0, line 156 <- wrt source file 2025-12-04T10:03:53.1124712Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py::remove_weight_norm:0 2025-12-04T10:03:53.1125904Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/_expanded_weights/conv_utils.py::unfold3d:0, line 315 <- wrt source file 2025-12-04T10:03:53.1127154Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/_expanded_weights/conv_utils.py::unfold3d:0 2025-12-04T10:03:53.1128520Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/_expanded_weights/expanded_weights_utils.py::sum_over_all_but_batch_and_last_n:0, line 178 <- wrt source file 2025-12-04T10:03:53.1131958Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/_expanded_weights/expanded_weights_utils.py::sum_over_all_but_batch_and_last_n:0 2025-12-04T10:03:53.1133421Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::LambdaLR:0, line 357 <- wrt source file 2025-12-04T10:03:53.1134505Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::LambdaLR:0 2025-12-04T10:03:53.1135595Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::MultiplicativeLR:0, line 483 <- wrt source file 2025-12-04T10:03:53.1136741Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::MultiplicativeLR:0 2025-12-04T10:03:53.1137803Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::StepLR:0, line 608 <- wrt source file 2025-12-04T10:03:53.1138838Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::StepLR:0 2025-12-04T10:03:53.1139887Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::MultiStepLR:0, line 695 <- wrt source file 2025-12-04T10:03:53.1140982Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::MultiStepLR:0 2025-12-04T10:03:53.1142033Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::ConstantLR:0, line 791 <- wrt source file 2025-12-04T10:03:53.1143105Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::ConstantLR:0 2025-12-04T10:03:53.1144146Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::LinearLR:0, line 898 <- wrt source file 2025-12-04T10:03:53.1145198Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::LinearLR:0 2025-12-04T10:03:53.1146268Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::ExponentialLR:0, line 1020 <- wrt source file 2025-12-04T10:03:53.1147475Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::ExponentialLR:0 2025-12-04T10:03:53.1148617Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::SequentialLR:0, line 1097 <- wrt source file 2025-12-04T10:03:53.1149726Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::SequentialLR:0 2025-12-04T10:03:53.1150801Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::PolynomialLR:0, line 1249 <- wrt source file 2025-12-04T10:03:53.1151899Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::PolynomialLR:0 2025-12-04T10:03:53.1153013Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::CosineAnnealingLR:0, line 1378 <- wrt source file 2025-12-04T10:03:53.1154234Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::CosineAnnealingLR:0 2025-12-04T10:03:53.1155550Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::ChainedScheduler:0, line 1490 <- wrt source file 2025-12-04T10:03:53.1156713Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::ChainedScheduler:0 2025-12-04T10:03:53.1157785Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::CyclicLR:0, line 1863 <- wrt source file 2025-12-04T10:03:53.1158860Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::CyclicLR:0 2025-12-04T10:03:53.1160095Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::CosineAnnealingWarmRestarts:0, line 2129 <- wrt source file 2025-12-04T10:03:53.1161446Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::CosineAnnealingWarmRestarts:0 2025-12-04T10:03:53.1162715Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::CosineAnnealingWarmRestarts.step:0, line 2211 <- wrt source file 2025-12-04T10:03:53.1164027Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::CosineAnnealingWarmRestarts.step:0 2025-12-04T10:03:53.1165318Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::CosineAnnealingWarmRestarts.step:1, line 2227 <- wrt source file 2025-12-04T10:03:53.1166638Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::CosineAnnealingWarmRestarts.step:1 2025-12-04T10:03:53.1167808Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::OneCycleLR:0, line 2367 <- wrt source file 2025-12-04T10:03:53.1168893Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py::OneCycleLR:0 2025-12-04T10:03:53.1170002Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/optimizer.py::Optimizer.load_state_dict:0, line 900 <- wrt source file 2025-12-04T10:03:53.1171198Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/optimizer.py::Optimizer.load_state_dict:0 2025-12-04T10:03:53.1172309Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/swa_utils.py::AveragedModel:0, line 155 <- wrt source file 2025-12-04T10:03:53.1173384Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/swa_utils.py::AveragedModel:0 2025-12-04T10:03:53.1174437Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/swa_utils.py::AveragedModel:1, line 181 <- wrt source file 2025-12-04T10:03:53.1175507Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/swa_utils.py::AveragedModel:1 2025-12-04T10:03:53.1176607Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/swa_utils.py::update_bn:0, line 350 <- wrt source file 2025-12-04T10:03:53.1177646Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/swa_utils.py::update_bn:0 2025-12-04T10:03:53.1178629Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/swa_utils.py::SWALR:0, line 409 <- wrt source file 2025-12-04T10:03:53.1179632Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/swa_utils.py::SWALR:0 2025-12-04T10:03:53.1180575Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/package/glob_group.py::GlobGroup:0, line 22 <- wrt source file 2025-12-04T10:03:53.1181407Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/package/glob_group.py::GlobGroup:0 2025-12-04T10:03:53.1182330Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/profiler/profiler.py::_KinetoProfile.toggle_collection_dynamic:0, line 317 <- wrt source file 2025-12-04T10:03:53.1183374Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/profiler/profiler.py::_KinetoProfile.toggle_collection_dynamic:0 2025-12-04T10:03:53.1184263Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/profiler/profiler.py::profile:0, line 659 <- wrt source file 2025-12-04T10:03:53.1185063Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/profiler/profiler.py::profile:0 2025-12-04T10:03:53.1186022Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/sparse/semi_structured.py::to_sparse_semi_structured:0, line 342 <- wrt source file 2025-12-04T10:03:53.1186977Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/sparse/semi_structured.py::to_sparse_semi_structured:0 2025-12-04T10:03:53.1187983Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_comparison.py::assert_close:0, line 1477 <- wrt source file 2025-12-04T10:03:53.1207977Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_comparison.py::assert_close:0 2025-12-04T10:03:53.1209009Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_creation.py::make_tensor:0, line 114 <- wrt source file 2025-12-04T10:03:53.1210038Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_creation.py::make_tensor:0 2025-12-04T10:03:53.1211132Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::parametrize:0, line 648 <- wrt source file 2025-12-04T10:03:53.1212297Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::parametrize:0 2025-12-04T10:03:53.1213542Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::reparametrize:0, line 769 <- wrt source file 2025-12-04T10:03:53.1214698Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::reparametrize:0 2025-12-04T10:03:53.1215792Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::decorateIf:0, line 858 <- wrt source file 2025-12-04T10:03:53.1217216Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::decorateIf:0 2025-12-04T10:03:53.1218376Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::random_symmetric_psd_matrix:0, line 4839 <- wrt source file 2025-12-04T10:03:53.1219564Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::random_symmetric_psd_matrix:0 2025-12-04T10:03:53.1220640Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::random_hermitian_psd_matrix:0, line 4853 <- wrt source file 2025-12-04T10:03:53.1221660Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::random_hermitian_psd_matrix:0 2025-12-04T10:03:53.1222650Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::random_hermitian_pd_matrix:0, line 4883 <- wrt source file 2025-12-04T10:03:53.1223735Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py::random_hermitian_pd_matrix:0 2025-12-04T10:03:53.1224674Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/logging_utils.py::logs_to_string:0, line 194 <- wrt source file 2025-12-04T10:03:53.1225606Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/logging_utils.py::logs_to_string:0 2025-12-04T10:03:53.1226537Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/logging_utils.py::multiple_logs_to_string:0, line 220 <- wrt source file 2025-12-04T10:03:53.1227605Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/logging_utils.py::multiple_logs_to_string:0 2025-12-04T10:03:53.1228720Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py::skip_unless_torch_gpu:0, line 341 <- wrt source file 2025-12-04T10:03:53.1229891Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/distributed/_tensor/common_dtensor.py::skip_unless_torch_gpu:0 2025-12-04T10:03:53.1231016Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/optests/autograd_registration.py::autograd_registration_check:0, line 29 <- wrt source file 2025-12-04T10:03:53.1239347Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/optests/autograd_registration.py::autograd_registration_check:0 2025-12-04T10:03:53.1240383Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::register_pytree_node:0, line 159 <- wrt source file 2025-12-04T10:03:53.1241289Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::register_pytree_node:0 2025-12-04T10:03:53.1242123Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_is_leaf:0, line 316 <- wrt source file 2025-12-04T10:03:53.1246951Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_is_leaf:0 2025-12-04T10:03:53.1247787Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_flatten:0, line 359 <- wrt source file 2025-12-04T10:03:53.1254748Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_flatten:0 2025-12-04T10:03:53.1256029Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_unflatten:0, line 396 <- wrt source file 2025-12-04T10:03:53.1259658Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_unflatten:0 2025-12-04T10:03:53.1260695Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_iter:0, line 429 <- wrt source file 2025-12-04T10:03:53.1266040Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_iter:0 2025-12-04T10:03:53.1266958Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_leaves:0, line 464 <- wrt source file 2025-12-04T10:03:53.1271299Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_leaves:0 2025-12-04T10:03:53.1272117Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_structure:0, line 499 <- wrt source file 2025-12-04T10:03:53.1276493Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_structure:0 2025-12-04T10:03:53.1277933Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_map:0, line 536 <- wrt source file 2025-12-04T10:03:53.1282577Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::tree_map:0 2025-12-04T10:03:53.1283594Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::broadcast_prefix:0, line 929 <- wrt source file 2025-12-04T10:03:53.1291364Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_cxx_pytree.py::broadcast_prefix:0 2025-12-04T10:03:53.1292423Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_pytree.py::register_dataclass:0, line 308 <- wrt source file 2025-12-04T10:03:53.1301956Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_pytree.py::register_dataclass:0 2025-12-04T10:03:53.1303198Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_pytree.py::register_constant:0, line 428 <- wrt source file 2025-12-04T10:03:53.1311230Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_pytree.py::register_constant:0 2025-12-04T10:03:53.1312257Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_pytree.py::tree_is_leaf:0, line 1058 <- wrt source file 2025-12-04T10:03:53.1317366Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_pytree.py::tree_is_leaf:0 2025-12-04T10:03:53.1318346Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_pytree.py::tree_map:0, line 1497 <- wrt source file 2025-12-04T10:03:53.1323855Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_pytree.py::tree_map:0 2025-12-04T10:03:53.1325176Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/backend_registration.py::rename_privateuse1_backend:0, line 71 <- wrt source file 2025-12-04T10:03:53.1326566Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/backend_registration.py::rename_privateuse1_backend:0 2025-12-04T10:03:53.1328136Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/backend_registration.py::generate_methods_for_privateuse1_backend:0, line 382 <- wrt source file 2025-12-04T10:03:53.1329808Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/backend_registration.py::generate_methods_for_privateuse1_backend:0 2025-12-04T10:03:53.1331048Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/backend_registration.py::_get_custom_mod_func:0, line 417 <- wrt source file 2025-12-04T10:03:53.1332232Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/backend_registration.py::_get_custom_mod_func:0 2025-12-04T10:03:53.1333298Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/checkpoint.py::checkpoint_sequential:0, line 561 <- wrt source file 2025-12-04T10:03:53.1334266Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/checkpoint.py::checkpoint_sequential:0 2025-12-04T10:03:53.1335171Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/checkpoint.py::set_checkpoint_early_stop:0, line 763 <- wrt source file 2025-12-04T10:03:53.1336089Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/checkpoint.py::set_checkpoint_early_stop:0 2025-12-04T10:03:53.1336993Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/checkpoint.py::SelectiveCheckpointContext:0, line 1257 <- wrt source file 2025-12-04T10:03:53.1337989Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/checkpoint.py::SelectiveCheckpointContext:0 2025-12-04T10:03:53.1338931Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/checkpoint.py::create_selective_checkpoint_contexts:0, line 1421 <- wrt source file 2025-12-04T10:03:53.1342138Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/checkpoint.py::create_selective_checkpoint_contexts:0 2025-12-04T10:03:53.1343519Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::CppExtension:0, line 1247 <- wrt source file 2025-12-04T10:03:53.1344906Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::CppExtension:0 2025-12-04T10:03:53.1346213Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::CUDAExtension:0, line 1319 <- wrt source file 2025-12-04T10:03:53.1347577Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::CUDAExtension:0 2025-12-04T10:03:53.1348625Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::CUDAExtension:1, line 1397 <- wrt source file 2025-12-04T10:03:53.1349677Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::CUDAExtension:1 2025-12-04T10:03:53.1350712Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::SyclExtension:0, line 1509 <- wrt source file 2025-12-04T10:03:53.1351781Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::SyclExtension:0 2025-12-04T10:03:53.1352771Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::load:0, line 1759 <- wrt source file 2025-12-04T10:03:53.1353607Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::load:0 2025-12-04T10:03:53.1354397Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::load_inline:0, line 2032 <- wrt source file 2025-12-04T10:03:53.1355415Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/cpp_extension.py::load_inline:0 2025-12-04T10:03:53.1356216Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/dlpack.py::from_dlpack:0, line 93 <- wrt source file 2025-12-04T10:03:53.1363925Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/dlpack.py::from_dlpack:0 2025-12-04T10:03:53.1365119Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/throughput_benchmark.py::ThroughputBenchmark:0, line 78 <- wrt source file 2025-12-04T10:03:53.1366343Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/throughput_benchmark.py::ThroughputBenchmark:0 2025-12-04T10:03:53.1367537Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_sympy/functions.py::MinMaxBase._collapse_arguments:0, line 742 <- wrt source file 2025-12-04T10:03:53.1778077Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_sympy/functions.py::MinMaxBase._collapse_arguments:0 2025-12-04T10:03:53.1779280Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataset.py::IterableDataset:0, line 94 <- wrt source file 2025-12-04T10:03:53.1784952Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataset.py::IterableDataset:0 2025-12-04T10:03:53.1786061Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataset.py::StackDataset:0, line 218 <- wrt source file 2025-12-04T10:03:53.1787060Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataset.py::StackDataset:0 2025-12-04T10:03:53.1788005Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataset.py::random_split:0, line 438 <- wrt source file 2025-12-04T10:03:53.1788865Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/dataset.py::random_split:0 2025-12-04T10:03:53.1789735Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/distributed.py::DistributedSampler:0, line 55 <- wrt source file 2025-12-04T10:03:53.1790662Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/distributed.py::DistributedSampler:0 2025-12-04T10:03:53.1791498Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/sampler.py::Sampler:0, line 36 <- wrt source file 2025-12-04T10:03:53.1792469Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/sampler.py::Sampler:0 2025-12-04T10:03:53.1793325Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/sampler.py::WeightedRandomSampler:0, line 225 <- wrt source file 2025-12-04T10:03:53.1794282Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/sampler.py::WeightedRandomSampler:0 2025-12-04T10:03:53.1795141Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/sampler.py::BatchSampler:0, line 296 <- wrt source file 2025-12-04T10:03:53.1800465Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/sampler.py::BatchSampler:0 2025-12-04T10:03:53.1801343Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py::default_convert:0, line 39 <- wrt source file 2025-12-04T10:03:53.1803968Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py::default_convert:0 2025-12-04T10:03:53.1805041Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py::collate:0, line 137 <- wrt source file 2025-12-04T10:03:53.1808964Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py::collate:0 2025-12-04T10:03:53.1810035Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py::default_collate:0, line 367 <- wrt source file 2025-12-04T10:03:53.1814394Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py::default_collate:0 2025-12-04T10:03:53.1815294Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/datapipe.py::IterDataPipe:0, line 97 <- wrt source file 2025-12-04T10:03:53.1817513Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/datapipe.py::IterDataPipe:0 2025-12-04T10:03:53.1818423Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/datapipe.py::MapDataPipe:0, line 269 <- wrt source file 2025-12-04T10:03:53.1819435Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/datapipe.py::MapDataPipe:0 2025-12-04T10:03:53.1820401Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py::MapperIterDataPipe:0, line 53 <- wrt source file 2025-12-04T10:03:53.1821418Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py::MapperIterDataPipe:0 2025-12-04T10:03:53.1822492Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py::CollatorIterDataPipe:0, line 202 <- wrt source file 2025-12-04T10:03:53.1823544Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/callable.py::CollatorIterDataPipe:0 2025-12-04T10:03:53.1824595Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combinatorics.py::ShufflerIterDataPipe:0, line 90 <- wrt source file 2025-12-04T10:03:53.1825682Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combinatorics.py::ShufflerIterDataPipe:0 2025-12-04T10:03:53.1826710Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::ConcaterIterDataPipe:0, line 38 <- wrt source file 2025-12-04T10:03:53.1853319Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::ConcaterIterDataPipe:0 2025-12-04T10:03:53.1854665Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::ForkerIterDataPipe:0, line 89 <- wrt source file 2025-12-04T10:03:53.1856138Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::ForkerIterDataPipe:0 2025-12-04T10:03:53.1857368Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::_ChildDataPipe:0, line 308 <- wrt source file 2025-12-04T10:03:53.1858615Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::_ChildDataPipe:0 2025-12-04T10:03:53.1859892Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::DemultiplexerIterDataPipe:0, line 397 <- wrt source file 2025-12-04T10:03:53.1861286Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::DemultiplexerIterDataPipe:0 2025-12-04T10:03:53.1862635Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::MultiplexerIterDataPipe:0, line 615 <- wrt source file 2025-12-04T10:03:53.1863827Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::MultiplexerIterDataPipe:0 2025-12-04T10:03:53.1864849Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::ZipperIterDataPipe:0, line 685 <- wrt source file 2025-12-04T10:03:53.1865888Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/combining.py::ZipperIterDataPipe:0 2025-12-04T10:03:53.1866936Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/filelister.py::FileListerIterDataPipe:0, line 29 <- wrt source file 2025-12-04T10:03:53.1868079Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/filelister.py::FileListerIterDataPipe:0 2025-12-04T10:03:53.1869199Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/fileopener.py::FileOpenerIterDataPipe:0, line 33 <- wrt source file 2025-12-04T10:03:53.1870275Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/fileopener.py::FileOpenerIterDataPipe:0 2025-12-04T10:03:53.1871293Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/grouping.py::BatcherIterDataPipe:0, line 41 <- wrt source file 2025-12-04T10:03:53.1872395Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/grouping.py::BatcherIterDataPipe:0 2025-12-04T10:03:53.1873417Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/grouping.py::UnBatcherIterDataPipe:0, line 102 <- wrt source file 2025-12-04T10:03:53.1874456Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/grouping.py::UnBatcherIterDataPipe:0 2025-12-04T10:03:53.1875464Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/grouping.py::GrouperIterDataPipe:0, line 169 <- wrt source file 2025-12-04T10:03:53.1876505Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/grouping.py::GrouperIterDataPipe:0 2025-12-04T10:03:53.1877517Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/selecting.py::FilterIterDataPipe:0, line 37 <- wrt source file 2025-12-04T10:03:53.1878662Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/selecting.py::FilterIterDataPipe:0 2025-12-04T10:03:53.1879708Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/streamreader.py::StreamReaderIterDataPipe:0, line 24 <- wrt source file 2025-12-04T10:03:53.1880808Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/streamreader.py::StreamReaderIterDataPipe:0 2025-12-04T10:03:53.1881882Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/utils.py::IterableWrapperIterDataPipe:0, line 29 <- wrt source file 2025-12-04T10:03:53.1882956Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/iter/utils.py::IterableWrapperIterDataPipe:0 2025-12-04T10:03:53.1883973Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/callable.py::MapperMapDataPipe:0, line 36 <- wrt source file 2025-12-04T10:03:53.1884972Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/callable.py::MapperMapDataPipe:0 2025-12-04T10:03:53.1885986Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/combinatorics.py::ShufflerIterDataPipe:0, line 34 <- wrt source file 2025-12-04T10:03:53.1887062Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/combinatorics.py::ShufflerIterDataPipe:0 2025-12-04T10:03:53.1888085Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/combining.py::ConcaterMapDataPipe:0, line 29 <- wrt source file 2025-12-04T10:03:53.1889107Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/combining.py::ConcaterMapDataPipe:0 2025-12-04T10:03:53.1890100Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/combining.py::ZipperMapDataPipe:0, line 76 <- wrt source file 2025-12-04T10:03:53.1891163Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/combining.py::ZipperMapDataPipe:0 2025-12-04T10:03:53.1892154Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/grouping.py::BatcherMapDataPipe:0, line 29 <- wrt source file 2025-12-04T10:03:53.1893175Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/grouping.py::BatcherMapDataPipe:0 2025-12-04T10:03:53.1894178Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/utils.py::SequenceWrapperMapDataPipe:0, line 29 <- wrt source file 2025-12-04T10:03:53.1895273Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/map/utils.py::SequenceWrapperMapDataPipe:0 2025-12-04T10:03:53.1896274Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/utils/common.py::validate_input_col:0, line 37 <- wrt source file 2025-12-04T10:03:53.1897265Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/utils/common.py::validate_input_col:0 2025-12-04T10:03:53.1898216Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/utils/decoder.py::basichandlers:0, line 47 <- wrt source file 2025-12-04T10:03:53.1899192Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/data/datapipes/utils/decoder.py::basichandlers:0 2025-12-04T10:03:53.1900206Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/hipify/hipify_python.py::find_closure_group:0, line 439 <- wrt source file 2025-12-04T10:03:53.6879051Z * SUCCESS: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/hipify/hipify_python.py::find_closure_group:0 2025-12-04T10:03:53.6880336Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/hipify/hipify_python.py::replace_extern_shared:0, line 535 <- wrt source file 2025-12-04T10:03:53.6881559Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/hipify/hipify_python.py::replace_extern_shared:0 2025-12-04T10:03:53.6882740Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.__init__:0, line 217 <- wrt source file 2025-12-04T10:03:53.6883818Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.__init__:0 2025-12-04T10:03:53.6884800Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_hparams:0, line 322 <- wrt source file 2025-12-04T10:03:53.6885791Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_hparams:0 2025-12-04T10:03:53.6886751Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_scalar:0, line 370 <- wrt source file 2025-12-04T10:03:53.6887761Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_scalar:0 2025-12-04T10:03:53.6888985Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_scalars:0, line 402 <- wrt source file 2025-12-04T10:03:53.6890054Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_scalars:0 2025-12-04T10:03:53.6891148Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_tensor:0, line 450 <- wrt source file 2025-12-04T10:03:53.6892691Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_tensor:0 2025-12-04T10:03:53.6893976Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_histogram:0, line 489 <- wrt source file 2025-12-04T10:03:53.6895092Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_histogram:0 2025-12-04T10:03:53.6896230Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_histogram_raw:0, line 542 <- wrt source file 2025-12-04T10:03:53.6897397Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_histogram_raw:0 2025-12-04T10:03:53.6898382Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_image:0, line 608 <- wrt source file 2025-12-04T10:03:53.6899486Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_image:0 2025-12-04T10:03:53.6900573Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_images:0, line 657 <- wrt source file 2025-12-04T10:03:53.6901562Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_images:0 2025-12-04T10:03:53.6902761Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_text:0, line 820 <- wrt source file 2025-12-04T10:03:53.6904339Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_text:0 2025-12-04T10:03:53.6906152Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_embedding:0, line 887 <- wrt source file 2025-12-04T10:03:53.6908064Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_embedding:0 2025-12-04T10:03:53.6909707Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_pr_curve:0, line 998 <- wrt source file 2025-12-04T10:03:53.6911371Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_pr_curve:0 2025-12-04T10:03:53.6912912Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_custom_scalars_multilinechart:0, line 1072 <- wrt source file 2025-12-04T10:03:53.6914350Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_custom_scalars_multilinechart:0 2025-12-04T10:03:53.6915944Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_custom_scalars_marginchart:0, line 1093 <- wrt source file 2025-12-04T10:03:53.6917890Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_custom_scalars_marginchart:0 2025-12-04T10:03:53.6919714Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_custom_scalars:0, line 1118 <- wrt source file 2025-12-04T10:03:53.6921439Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_custom_scalars:0 2025-12-04T10:03:53.6922591Z * DOCTEST : /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_mesh:0, line 1164 <- wrt source file 2025-12-04T10:03:53.6924108Z * SKIPPED: /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py::SummaryWriter.add_mesh:0 2025-12-04T10:03:53.6925035Z ============ 2025-12-04T10:03:53.6925343Z Finished doctests 2025-12-04T10:03:53.6925597Z 378 / 894 passed 2025-12-04T10:03:53.6925804Z  2025-12-04T10:03:53.6926023Z === Found 17 parse-time warnings === 2025-12-04T10:03:53.6926500Z --- Parse Warning: 1 / 17 --- 2025-12-04T10:03:53.6927671Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=Library.fallback in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py line=368. 2025-12-04T10:03:53.6928905Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:53.6929457Z Registers the function implementation as the fallback for the given key. 2025-12-04T10:03:53.6929958Z 2025-12-04T10:03:53.6930335Z This function only works for a library with global namespace ("_"). 2025-12-04T10:03:53.6930797Z 2025-12-04T10:03:53.6930936Z Args: 2025-12-04T10:03:53.6931223Z fn: function used as fallback for the given dispatch key or :func:`~fallthrough_kernel` 2025-12-04T10:03:53.6931581Z to register a fallthrough. 2025-12-04T10:03:53.6931957Z dispatch_key: dispatch key that the input function should be registered for. By default, it uses 2025-12-04T10:03:53.6932485Z the dispatch key that the library was created with. 2025-12-04T10:03:53.6932934Z with_keyset: flag controlling if the current dispatcher call keyset should be passed as the first argument 2025-12-04T10:03:53.6933481Z to :attr:`fn` when calling. This should be used to create the appropriate keyset for redispatch calls. 2025-12-04T10:03:53.6933828Z 2025-12-04T10:03:53.6933976Z Example:: 2025-12-04T10:03:53.6934138Z 2025-12-04T10:03:53.6934292Z >>> my_lib = Library("_", "IMPL") 2025-12-04T10:03:53.6934532Z >>> def fallback_kernel(op, *args, **kwargs): 2025-12-04T10:03:53.6934789Z >>> # Handle all autocast ops generically 2025-12-04T10:03:53.6935015Z >>> # ... 2025-12-04T10:03:53.6935229Z >>> my_lib.fallback(fallback_kernel, "Autocast") 2025-12-04T10:03:53.6935461Z 2025-12-04T10:03:53.6935985Z Original Error: IndentationError('expected an indented block after function definition on line 2', ('', 5, 1, 'my_lib.fallback(fallback_kernel, "Autocast")\n', 5, 7)) 2025-12-04T10:03:53.6936541Z 2025-12-04T10:03:53.6936703Z my_lib.fallback(fallback_kernel, "Autocast") 2025-12-04T10:03:53.6936924Z ^ 2025-12-04T10:03:53.6937072Z warnings.warn(msg) 2025-12-04T10:03:53.6937234Z 2025-12-04T10:03:53.6937460Z --- Parse Warning: 2 / 17 --- 2025-12-04T10:03:53.6938181Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=register_fake in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py line=958. 2025-12-04T10:03:53.6938969Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:53.6939373Z Register a FakeTensor implementation ("fake impl") for this operator. 2025-12-04T10:03:53.6939657Z 2025-12-04T10:03:53.6939879Z Also sometimes known as a "meta kernel", "abstract impl". 2025-12-04T10:03:53.6940129Z 2025-12-04T10:03:53.6940376Z An "FakeTensor implementation" specifies the behavior of this operator on 2025-12-04T10:03:53.6940775Z Tensors that carry no data ("FakeTensor"). Given some input Tensors with 2025-12-04T10:03:53.6941160Z certain properties (sizes/strides/storage_offset/device), it specifies 2025-12-04T10:03:53.6941560Z what the properties of the output Tensors are. 2025-12-04T10:03:53.6941794Z 2025-12-04T10:03:53.6942034Z The FakeTensor implementation has the same signature as the operator. 2025-12-04T10:03:53.6942408Z It is run for both FakeTensors and meta tensors. To write a FakeTensor 2025-12-04T10:03:53.6942777Z implementation, assume that all Tensor inputs to the operator are 2025-12-04T10:03:53.6943161Z regular CPU/CUDA/Meta tensors, but they do not have storage, and 2025-12-04T10:03:53.6943515Z you are trying to return regular CPU/CUDA/Meta tensor(s) as output. 2025-12-04T10:03:53.6943957Z The FakeTensor implementation must consist of only PyTorch operations 2025-12-04T10:03:53.6944327Z (and may not directly access the storage or data of any input or 2025-12-04T10:03:53.6944606Z intermediate Tensors). 2025-12-04T10:03:53.6944785Z 2025-12-04T10:03:53.6944976Z This API may be used as a decorator (see examples). 2025-12-04T10:03:53.6945215Z 2025-12-04T10:03:53.6945384Z For a detailed guide on custom ops, please see 2025-12-04T10:03:53.6945728Z https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html 2025-12-04T10:03:53.6946015Z 2025-12-04T10:03:53.6946138Z Args: 2025-12-04T10:03:53.6946383Z op_name: Operator name (along with the overload) or OpOverload object. 2025-12-04T10:03:53.6946692Z func: Fake tensor implementation. 2025-12-04T10:03:53.6946973Z lib (Optional[Library]): Library to register the fake tensor to. 2025-12-04T10:03:53.6947506Z allow_override: Flag controlling if we want to override an 2025-12-04T10:03:53.6947870Z existing registered fake impl. This is by default off, 2025-12-04T10:03:53.6948189Z and will error you're trying to register a fake impl to 2025-12-04T10:03:53.6948500Z an operator that already has a fake impl. This also only 2025-12-04T10:03:53.6948810Z applies if the custom operator was not created via 2025-12-04T10:03:53.6949131Z torch.library.custom_op, as overriding and existing fake 2025-12-04T10:03:53.6949413Z impl is already allowed. 2025-12-04T10:03:53.6949632Z 2025-12-04T10:03:53.6949770Z Examples: 2025-12-04T10:03:53.6949934Z >>> import torch 2025-12-04T10:03:53.6950121Z >>> import numpy as np 2025-12-04T10:03:53.6950333Z >>> from torch import Tensor 2025-12-04T10:03:53.6950540Z >>> 2025-12-04T10:03:53.6950757Z >>> # Example 1: an operator without data-dependent output shape 2025-12-04T10:03:53.6951108Z >>> @torch.library.custom_op("mylib::custom_linear", mutates_args=()) 2025-12-04T10:03:53.6951481Z >>> def custom_linear(x: Tensor, weight: Tensor, bias: Tensor) -> Tensor: 2025-12-04T10:03:53.6951838Z >>> raise NotImplementedError("Implementation goes here") 2025-12-04T10:03:53.6952111Z >>> 2025-12-04T10:03:53.6952321Z >>> @torch.library.register_fake("mylib::custom_linear") 2025-12-04T10:03:53.6952587Z >>> def _(x, weight, bias): 2025-12-04T10:03:53.6952800Z >>> assert x.dim() == 2 2025-12-04T10:03:53.6953027Z >>> assert weight.dim() == 2 2025-12-04T10:03:53.6953248Z >>> assert bias.dim() == 1 2025-12-04T10:03:53.6953478Z >>> assert x.shape[1] == weight.shape[1] 2025-12-04T10:03:53.6953739Z >>> assert weight.shape[0] == bias.shape[0] 2025-12-04T10:03:53.6954002Z >>> assert x.device == weight.device 2025-12-04T10:03:53.6954216Z >>> 2025-12-04T10:03:53.6954404Z >>> return (x @ weight.t()) + bias 2025-12-04T10:03:53.6954620Z >>> 2025-12-04T10:03:53.6954826Z >>> with torch._subclasses.fake_tensor.FakeTensorMode(): 2025-12-04T10:03:53.6955095Z >>> x = torch.randn(2, 3) 2025-12-04T10:03:53.6955609Z >>> w = torch.randn(3, 3) 2025-12-04T10:03:53.6955837Z >>> b = torch.randn(3) 2025-12-04T10:03:53.6956065Z >>> y = torch.ops.mylib.custom_linear(x, w, b) 2025-12-04T10:03:53.6956291Z >>> 2025-12-04T10:03:53.6956451Z >>> assert y.shape == (2, 3) 2025-12-04T10:03:53.6956644Z >>> 2025-12-04T10:03:53.6956849Z >>> # Example 2: an operator with data-dependent output shape 2025-12-04T10:03:53.6957201Z >>> @torch.library.custom_op("mylib::custom_nonzero", mutates_args=()) 2025-12-04T10:03:53.6957586Z >>> def custom_nonzero(x: Tensor) -> Tensor: 2025-12-04T10:03:53.6957831Z >>> x_np = x.numpy(force=True) 2025-12-04T10:03:53.6958070Z >>> res = np.stack(np.nonzero(x_np), axis=1) 2025-12-04T10:03:53.6958330Z >>> return torch.tensor(res, device=x.device) 2025-12-04T10:03:53.6958559Z >>> 2025-12-04T10:03:53.6958773Z >>> @torch.library.register_fake("mylib::custom_nonzero") 2025-12-04T10:03:53.6959031Z >>> def _(x): 2025-12-04T10:03:53.6959241Z >>> # Number of nonzero-elements is data-dependent. 2025-12-04T10:03:53.6959524Z >>> # Since we cannot peek at the data in an fake impl, 2025-12-04T10:03:53.6959813Z >>> # we use the ctx object to construct a new symint that 2025-12-04T10:03:53.6960079Z >>> # represents the data-dependent size. 2025-12-04T10:03:53.6960324Z >>> ctx = torch.library.get_ctx() 2025-12-04T10:03:53.6960633Z >>> nnz = ctx.new_dynamic_size() 2025-12-04T10:03:53.6960911Z >>> shape = [nnz, x.dim()] 2025-12-04T10:03:53.6961158Z >>> result = x.new_empty(shape, dtype=torch.int64) 2025-12-04T10:03:53.6961416Z >>> return result 2025-12-04T10:03:53.6961604Z >>> 2025-12-04T10:03:53.6961816Z >>> from torch.fx.experimental.proxy_tensor import make_fx 2025-12-04T10:03:53.6962071Z >>> 2025-12-04T10:03:53.6962241Z >>> x = torch.tensor([0, 1, 2, 3, 4, 0]) 2025-12-04T10:03:53.6962554Z >>> trace = make_fx(torch.ops.mylib.custom_nonzero, tracing_mode="symbolic")(x) 2025-12-04T10:03:53.6962879Z >>> trace.print_readable() 2025-12-04T10:03:53.6963073Z >>> 2025-12-04T10:03:53.6963310Z >>> assert torch.allclose(trace(x), torch.ops.mylib.custom_nonzero(x)) 2025-12-04T10:03:53.6963588Z 2025-12-04T10:03:53.6963727Z 2025-12-04T10:03:53.6964174Z Original Error: IndentationError('expected an indented block after function definition on line 37', ('', 38, 1, '_._ = None\n', 38, 2)) 2025-12-04T10:03:53.6964652Z 2025-12-04T10:03:53.6964788Z _._ = None 2025-12-04T10:03:53.6964937Z ^ 2025-12-04T10:03:53.6965081Z warnings.warn(msg) 2025-12-04T10:03:53.6965270Z 2025-12-04T10:03:53.6965499Z --- Parse Warning: 3 / 17 --- 2025-12-04T10:03:53.6966213Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=get_kernel in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py line=1530. 2025-12-04T10:03:53.6967275Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:53.6967938Z Returns the computed kernel for a given operator and dispatch key. 2025-12-04T10:03:53.6968278Z 2025-12-04T10:03:53.6968505Z This function retrieves the kernel that would be executed for a given 2025-12-04T10:03:53.6968899Z operator and dispatch key combination. The returned SafeKernelFunction 2025-12-04T10:03:53.6969273Z can be used to call the kernel in a boxed fashion. The intended use 2025-12-04T10:03:53.6969621Z case for this function is to retrieve the original kernel for a given 2025-12-04T10:03:53.6969980Z dispatch key and then register another kernel to the same dispatch key 2025-12-04T10:03:53.6970382Z that calls into the original kernel for certain cases. 2025-12-04T10:03:53.6970630Z 2025-12-04T10:03:53.6970755Z Args: 2025-12-04T10:03:53.6970976Z op: Operator name (along with the overload) or OpOverload object 2025-12-04T10:03:53.6971343Z Can be a string (e.g., "aten::add.Tensor"), an OpOverload, or a CustomOpDef. 2025-12-04T10:03:53.6971744Z dispatch_key (str | torch.DispatchKey): The dispatch key to get the kernel for. 2025-12-04T10:03:53.6972124Z Can be a string (e.g., "CPU", "CUDA") or a DispatchKey enum value. 2025-12-04T10:03:53.6972441Z 2025-12-04T10:03:53.6972572Z Returns: 2025-12-04T10:03:53.6972823Z torch._C._SafeKernelFunction: A safe kernel function that can be used to 2025-12-04T10:03:53.6973126Z call the kernel. 2025-12-04T10:03:53.6973310Z 2025-12-04T10:03:53.6973435Z Raises: 2025-12-04T10:03:53.6973629Z RuntimeError: If the operator does not exist. 2025-12-04T10:03:53.6973860Z 2025-12-04T10:03:53.6973987Z Example: 2025-12-04T10:03:53.6974165Z >>> # Get the CPU kernel for torch.add 2025-12-04T10:03:53.6974452Z >>> kernel = torch.library.get_kernel("aten::add.Tensor", "CPU") 2025-12-04T10:03:53.6974711Z >>> 2025-12-04T10:03:53.6974888Z >>> # You can also use DispatchKey enum 2025-12-04T10:03:53.6975222Z >>> kernel = torch.library.get_kernel("aten::add.Tensor", torch.DispatchKey.CPU) 2025-12-04T10:03:53.6975539Z >>> 2025-12-04T10:03:53.6975749Z >>> # Or use an OpOverload directly 2025-12-04T10:03:53.6976110Z >>> kernel = torch.library.get_kernel(torch.ops.aten.add.Tensor, "CPU") 2025-12-04T10:03:53.6976415Z >>> 2025-12-04T10:03:53.6976649Z >>> # Example: Using get_kernel in a custom op with conditional dispatch 2025-12-04T10:03:53.6976963Z >>> # Get the original kernel for torch.sin 2025-12-04T10:03:53.6977272Z >>> original_sin_kernel = torch.library.get_kernel("aten::sin", "CPU") 2025-12-04T10:03:53.6977554Z >>> 2025-12-04T10:03:53.6977788Z >>> # If input has negative values, use original sin, otherwise return zeros 2025-12-04T10:03:53.6978117Z >>> def conditional_sin_impl(dispatch_keys, x): 2025-12-04T10:03:53.6978362Z >>> if (x < 0).any(): 2025-12-04T10:03:53.6978615Z >>> return original_sin_kernel.call_boxed(dispatch_keys, x) 2025-12-04T10:03:53.6978887Z >>> else: 2025-12-04T10:03:53.6979083Z >>> return torch.zeros_like(x) 2025-12-04T10:03:53.6979290Z >>> 2025-12-04T10:03:53.6979477Z >>> lib = torch.library.Library("aten", "IMPL") 2025-12-04T10:03:53.6979825Z >>> # with_keyset=True so the first argument to the impl is the current DispatchKeySet 2025-12-04T10:03:53.6980211Z >>> which needs to be the first argument to ``kernel.call_boxed`` 2025-12-04T10:03:53.6980540Z >>> lib.impl("sin", conditional_sin_impl, "CPU", with_keyset=True) 2025-12-04T10:03:53.6980808Z >>> 2025-12-04T10:03:53.6980985Z >>> # Test the conditional behavior 2025-12-04T10:03:53.6981217Z >>> x_positive = torch.tensor([1.0, 2.0]) 2025-12-04T10:03:53.6981463Z >>> x_mixed = torch.tensor([-1.0, 2.0]) 2025-12-04T10:03:53.6981694Z >>> torch.sin(x_positive) 2025-12-04T10:03:53.6981900Z tensor([0., 0.]) 2025-12-04T10:03:53.6982095Z >>> torch.sin(x_mixed) 2025-12-04T10:03:53.6982293Z tensor([-0.8415, 0.9093]) 2025-12-04T10:03:53.6982483Z 2025-12-04T10:03:53.6982888Z Original Error: SyntaxError('invalid syntax', ('', 23, 7, 'which needs to be the first argument to ``kernel.call_boxed``\n', 23, 12)) 2025-12-04T10:03:53.6983338Z 2025-12-04T10:03:53.6983539Z which needs to be the first argument to ``kernel.call_boxed`` 2025-12-04T10:03:53.6983845Z ^ 2025-12-04T10:03:53.6984001Z warnings.warn(msg) 2025-12-04T10:03:53.6984171Z 2025-12-04T10:03:53.6984384Z --- Parse Warning: 4 / 17 --- 2025-12-04T10:03:53.6985135Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=is_available in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/accelerator/__init__.py line=70. 2025-12-04T10:03:53.6985985Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:53.6986446Z Check if the current accelerator is available at runtime: it was build, all the 2025-12-04T10:03:53.6986833Z required drivers are available and at least one device is visible. 2025-12-04T10:03:53.6987157Z See :ref:`accelerator` for details. 2025-12-04T10:03:53.6987481Z 2025-12-04T10:03:53.6987610Z Returns: 2025-12-04T10:03:53.6987899Z bool: A boolean indicating if there is an available :ref:`accelerator`. 2025-12-04T10:03:53.6988220Z 2025-12-04T10:03:53.6988458Z .. note:: This API delegates to the device-specific version of `is_available`. 2025-12-04T10:03:53.6988873Z On CUDA, when the environment variable ``PYTORCH_NVML_BASED_CUDA_CHECK=1`` is set, 2025-12-04T10:03:53.6989360Z this function will NOT poison fork. Otherwise, it will. For more details, see 2025-12-04T10:03:53.6989957Z :ref:`multiprocessing-poison-fork-note`. 2025-12-04T10:03:53.6990339Z 2025-12-04T10:03:53.6990610Z Example:: 2025-12-04T10:03:53.6990763Z 2025-12-04T10:03:53.6991065Z >>> assert torch.accelerator.is_available() "No available accelerators detected." 2025-12-04T10:03:53.6991387Z 2025-12-04T10:03:53.6991851Z Original Error: SyntaxError('invalid syntax', ('', 1, 41, 'assert torch.accelerator.is_available() "No available accelerators detected."\n', 1, 78)) 2025-12-04T10:03:53.6992355Z 2025-12-04T10:03:53.6992737Z assert torch.accelerator.is_available() "No available accelerators detected." 2025-12-04T10:03:53.6993285Z ^ 2025-12-04T10:03:53.6993641Z warnings.warn(msg) 2025-12-04T10:03:53.6993848Z 2025-12-04T10:03:53.6994074Z --- Parse Warning: 5 / 17 --- 2025-12-04T10:03:53.6994831Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=synchronize in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/accelerator/__init__.py line=239. 2025-12-04T10:03:53.6995677Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:53.6996065Z Wait for all kernels in all streams on the given device to complete. 2025-12-04T10:03:53.6996337Z 2025-12-04T10:03:53.6996471Z Args: 2025-12-04T10:03:53.6996769Z device (:class:`torch.device`, str, int, optional): device for which to synchronize. It must match 2025-12-04T10:03:53.6997223Z the current :ref:`accelerator` device type. If not given, 2025-12-04T10:03:53.6997594Z use :func:`torch.accelerator.current_device_index` by default. 2025-12-04T10:03:53.6997860Z 2025-12-04T10:03:53.6998143Z .. note:: This function is a no-op if the current :ref:`accelerator` is not initialized. 2025-12-04T10:03:53.6998481Z 2025-12-04T10:03:53.6998613Z Example:: 2025-12-04T10:03:53.6998756Z 2025-12-04T10:03:53.6998927Z >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_CUDA) 2025-12-04T10:03:53.6999278Z >>> assert torch.accelerator.is_available() "No available accelerators detected." 2025-12-04T10:03:53.6999629Z >>> start_event = torch.Event(enable_timing=True) 2025-12-04T10:03:53.6999902Z >>> end_event = torch.Event(enable_timing=True) 2025-12-04T10:03:53.7000144Z >>> start_event.record() 2025-12-04T10:03:53.7000506Z >>> tensor = torch.randn(100, device=torch.accelerator.current_accelerator()) 2025-12-04T10:03:53.7000821Z >>> sum = torch.sum(tensor) 2025-12-04T10:03:53.7001030Z >>> end_event.record() 2025-12-04T10:03:53.7001251Z >>> torch.accelerator.synchronize() 2025-12-04T10:03:53.7001522Z >>> elapsed_time_ms = start_event.elapsed_time(end_event) 2025-12-04T10:03:53.7001768Z 2025-12-04T10:03:53.7002226Z Original Error: SyntaxError('invalid syntax', ('', 2, 41, 'assert torch.accelerator.is_available() "No available accelerators detected."\n', 2, 78)) 2025-12-04T10:03:53.7002785Z 2025-12-04T10:03:53.7003033Z assert torch.accelerator.is_available() "No available accelerators detected." 2025-12-04T10:03:53.7003352Z ^ 2025-12-04T10:03:53.7003560Z warnings.warn(msg) 2025-12-04T10:03:53.7003722Z 2025-12-04T10:03:53.7003927Z --- Parse Warning: 6 / 17 --- 2025-12-04T10:03:53.7004648Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=cudart in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/__init__.py line=448. 2025-12-04T10:03:53.7005429Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:53.7005755Z Retrieves the CUDA runtime API module. 2025-12-04T10:03:53.7005962Z 2025-12-04T10:03:53.7006093Z 2025-12-04T10:03:53.7006334Z This function initializes the CUDA runtime environment if it is not already 2025-12-04T10:03:53.7006836Z initialized and returns the CUDA runtime API module (_cudart). The CUDA 2025-12-04T10:03:53.7007234Z runtime API module provides access to various CUDA runtime functions. 2025-12-04T10:03:53.7007516Z 2025-12-04T10:03:53.7007649Z Args: 2025-12-04T10:03:53.7007796Z ``None`` 2025-12-04T10:03:53.7007947Z 2025-12-04T10:03:53.7008084Z Returns: 2025-12-04T10:03:53.7008292Z module: The CUDA runtime API module (_cudart). 2025-12-04T10:03:53.7008520Z 2025-12-04T10:03:53.7008657Z Raises: 2025-12-04T10:03:53.7008909Z RuntimeError: If CUDA cannot be re-initialized in a forked subprocess. 2025-12-04T10:03:53.7009394Z AssertionError: If PyTorch is not compiled with CUDA support or if libcudart functions are unavailable. 2025-12-04T10:03:53.7009764Z 2025-12-04T10:03:53.7009938Z Example of CUDA operations with profiling: 2025-12-04T10:03:53.7010173Z >>> import torch 2025-12-04T10:03:53.7010387Z >>> from torch.cuda import cudart, check_error 2025-12-04T10:03:53.7010623Z >>> import os 2025-12-04T10:03:53.7010795Z >>> 2025-12-04T10:03:53.7010974Z >>> os.environ["CUDA_PROFILE"] = "1" 2025-12-04T10:03:53.7011189Z >>> 2025-12-04T10:03:53.7011372Z >>> def perform_cuda_operations_with_streams(): 2025-12-04T10:03:53.7011616Z >>> stream = torch.cuda.Stream() 2025-12-04T10:03:53.7011851Z >>> with torch.cuda.stream(stream): 2025-12-04T10:03:53.7012108Z >>> x = torch.randn(100, 100, device='cuda') 2025-12-04T10:03:53.7012358Z >>> y = torch.randn(100, 100, device='cuda') 2025-12-04T10:03:53.7012586Z >>> z = torch.mul(x, y) 2025-12-04T10:03:53.7012791Z >>> return z 2025-12-04T10:03:53.7012971Z >>> 2025-12-04T10:03:53.7013131Z >>> torch.cuda.synchronize() 2025-12-04T10:03:53.7013388Z >>> print("====== Start nsys profiling ======") 2025-12-04T10:03:53.7013655Z >>> check_error(cudart().cudaProfilerStart()) 2025-12-04T10:03:53.7013921Z >>> with torch.autograd.profiler.emit_nvtx(): 2025-12-04T10:03:53.7014196Z >>> result = perform_cuda_operations_with_streams() 2025-12-04T10:03:53.7014467Z >>> print("CUDA operations completed.") 2025-12-04T10:03:53.7014784Z >>> check_error(torch.cuda.cudart().cudaProfilerStop()) 2025-12-04T10:03:53.7015054Z >>> print("====== End nsys profiling ======") 2025-12-04T10:03:53.7015267Z 2025-12-04T10:03:53.7015481Z To run this example and save the profiling information, execute: 2025-12-04T10:03:53.7015945Z >>> $ nvprof --profile-from-start off --csv --print-summary -o trace_name.prof -f -- python cudart_test.py 2025-12-04T10:03:53.7016322Z 2025-12-04T10:03:53.7016566Z This command profiles the CUDA operations in the provided script and saves 2025-12-04T10:03:53.7016985Z the profiling information to a file named `trace_name.prof`. 2025-12-04T10:03:53.7017352Z The `--profile-from-start off` option ensures that profiling starts only 2025-12-04T10:03:53.7017688Z after the `cudaProfilerStart` call in the script. 2025-12-04T10:03:53.7018011Z The `--csv` and `--print-summary` options format the profiling output as a 2025-12-04T10:03:53.7018326Z CSV file and print a summary, respectively. 2025-12-04T10:03:53.7018658Z The `-o` option specifies the output file name, and the `-f` option forces the 2025-12-04T10:03:53.7019008Z overwrite of the output file if it already exists. 2025-12-04T10:03:53.7019237Z 2025-12-04T10:03:53.7019764Z Original Error: SyntaxError('invalid syntax', ('', 1, 1, '$ nvprof --profile-from-start off --csv --print-summary -o trace_name.prof -f -- python cudart_test.py\n', 1, 2)) 2025-12-04T10:03:53.7020334Z 2025-12-04T10:03:53.7020730Z $ nvprof --profile-from-start off --csv --print-summary -o trace_name.prof -f -- python cudart_test.py 2025-12-04T10:03:53.7021095Z ^ 2025-12-04T10:03:53.7021247Z warnings.warn(msg) 2025-12-04T10:03:53.7021420Z 2025-12-04T10:03:53.7021633Z --- Parse Warning: 7 / 17 --- 2025-12-04T10:03:53.7022354Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=vmap in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/apis.py line=39. 2025-12-04T10:03:53.7023154Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:53.7023456Z 2025-12-04T10:03:53.7023675Z vmap is the vectorizing map; ``vmap(func)`` returns a new function that 2025-12-04T10:03:53.7024039Z maps ``func`` over some dimension of the inputs. Semantically, vmap 2025-12-04T10:03:53.7024397Z pushes the map into PyTorch operations called by ``func``, effectively 2025-12-04T10:03:53.7024699Z vectorizing those operations. 2025-12-04T10:03:53.7024890Z 2025-12-04T10:03:53.7025114Z vmap is useful for handling batch dimensions: one can write a function 2025-12-04T10:03:53.7025467Z ``func`` that runs on examples and then lift it to a function that can 2025-12-04T10:03:53.7025826Z take batches of examples with ``vmap(func)``. vmap can also be used to 2025-12-04T10:03:53.7026160Z compute batched gradients when composed with autograd. 2025-12-04T10:03:53.7026403Z 2025-12-04T10:03:53.7026531Z .. note:: 2025-12-04T10:03:53.7026751Z :func:`torch.vmap` is aliased to :func:`torch.func.vmap` for 2025-12-04T10:03:53.7027053Z convenience. Use whichever one you'd like. 2025-12-04T10:03:53.7027355Z 2025-12-04T10:03:53.7027490Z Args: 2025-12-04T10:03:53.7027715Z func (function): A Python function that takes one or more arguments. 2025-12-04T10:03:53.7028008Z Must return one or more Tensors. 2025-12-04T10:03:53.7028295Z in_dims (int or nested structure): Specifies which dimension of the 2025-12-04T10:03:53.7028632Z inputs should be mapped over. ``in_dims`` should have a 2025-12-04T10:03:53.7028954Z structure like the inputs. If the ``in_dim`` for a particular 2025-12-04T10:03:53.7029280Z input is None, then that indicates there is no map dimension. 2025-12-04T10:03:53.7029545Z Default: 0. 2025-12-04T10:03:53.7029830Z out_dims (int or Tuple[int]): Specifies where the mapped dimension 2025-12-04T10:03:53.7030159Z should appear in the outputs. If ``out_dims`` is a Tuple, then 2025-12-04T10:03:53.7030467Z it should have one element per output. Default: 0. 2025-12-04T10:03:53.7030775Z randomness (str): Specifies whether the randomness in this 2025-12-04T10:03:53.7031118Z vmap should be the same or different across batches. If 'different', 2025-12-04T10:03:53.7031467Z the randomness for each batch will be different. If 'same', the 2025-12-04T10:03:53.7031883Z randomness will be the same across batches. If 'error', any calls to 2025-12-04T10:03:53.7032248Z random functions will error. Default: 'error'. WARNING: this flag 2025-12-04T10:03:53.7032600Z only applies to random PyTorch operations and does not apply to 2025-12-04T10:03:53.7032910Z Python's random module or numpy randomness. 2025-12-04T10:03:53.7033233Z chunk_size (None or int): If None (default), apply a single vmap over inputs. 2025-12-04T10:03:53.7033614Z If not None, then compute the vmap :attr:`chunk_size` samples at a time. 2025-12-04T10:03:53.7034017Z Note that :attr:`chunk_size=1` is equivalent to computing the vmap with a for-loop. 2025-12-04T10:03:53.7034453Z If you run into memory issues computing the vmap, please try a non-None chunk_size. 2025-12-04T10:03:53.7034768Z 2025-12-04T10:03:53.7034898Z Returns: 2025-12-04T10:03:53.7035122Z Returns a new "batched" function. It takes the same inputs as 2025-12-04T10:03:53.7035558Z ``func``, except each input has an extra dimension at the index 2025-12-04T10:03:53.7035903Z specified by ``in_dims``. It takes returns the same outputs as 2025-12-04T10:03:53.7036228Z ``func``, except each output has an extra dimension at the index 2025-12-04T10:03:53.7036505Z specified by ``out_dims``. 2025-12-04T10:03:53.7036700Z 2025-12-04T10:03:53.7036831Z .. warning: 2025-12-04T10:03:53.7037068Z :func:`vmap` works best with functional-style code. Please do not 2025-12-04T10:03:53.7037420Z perform any side-effects in ``func``, with the exception of 2025-12-04T10:03:53.7037781Z in-place PyTorch operations. Examples of side-effects include mutating 2025-12-04T10:03:53.7038177Z Python data structures and assigning values to variables not captured 2025-12-04T10:03:53.7038468Z in ``func``. 2025-12-04T10:03:53.7038626Z 2025-12-04T10:03:53.7038863Z One example of using :func:`vmap` is to compute batched dot products. PyTorch 2025-12-04T10:03:53.7039255Z doesn't provide a batched ``torch.dot`` API; instead of unsuccessfully 2025-12-04T10:03:53.7039628Z rummaging through docs, use :func:`vmap` to construct a new function. 2025-12-04T10:03:53.7039905Z 2025-12-04T10:03:53.7040054Z >>> torch.dot # [D], [D] -> [] 2025-12-04T10:03:53.7040335Z >>> batched_dot = torch.func.vmap(torch.dot) # [N, D], [N, D] -> [N] 2025-12-04T10:03:53.7040630Z >>> x, y = torch.randn(2, 5), torch.randn(2, 5) 2025-12-04T10:03:53.7040870Z >>> batched_dot(x, y) 2025-12-04T10:03:53.7041048Z 2025-12-04T10:03:53.7041269Z :func:`vmap` can be helpful in hiding batch dimensions, leading to a simpler 2025-12-04T10:03:53.7041575Z model authoring experience. 2025-12-04T10:03:53.7041758Z 2025-12-04T10:03:53.7041906Z >>> batch_size, feature_size = 3, 5 2025-12-04T10:03:53.7042167Z >>> weights = torch.randn(feature_size, requires_grad=True) 2025-12-04T10:03:53.7042423Z >>> 2025-12-04T10:03:53.7042579Z >>> def model(feature_vec): 2025-12-04T10:03:53.7042796Z >>> # Very simple linear model with activation 2025-12-04T10:03:53.7043044Z >>> return feature_vec.dot(weights).relu() 2025-12-04T10:03:53.7043257Z >>> 2025-12-04T10:03:53.7043436Z >>> examples = torch.randn(batch_size, feature_size) 2025-12-04T10:03:53.7043750Z >>> result = torch.vmap(model)(examples) 2025-12-04T10:03:53.7043973Z 2025-12-04T10:03:53.7044216Z :func:`vmap` can also help vectorize computations that were previously difficult 2025-12-04T10:03:53.7044623Z or impossible to batch. One example is higher-order gradient computation. 2025-12-04T10:03:53.7045013Z The PyTorch autograd engine computes vjps (vector-Jacobian products). 2025-12-04T10:03:53.7045399Z Computing a full Jacobian matrix for some function f: R^N -> R^N usually 2025-12-04T10:03:53.7045830Z requires N calls to ``autograd.grad``, one per Jacobian row. Using :func:`vmap`, 2025-12-04T10:03:53.7046237Z we can vectorize the whole computation, computing the Jacobian in a single 2025-12-04T10:03:53.7046544Z call to ``autograd.grad``. 2025-12-04T10:03:53.7046720Z 2025-12-04T10:03:53.7046858Z >>> # Setup 2025-12-04T10:03:53.7047014Z >>> N = 5 2025-12-04T10:03:53.7047167Z >>> f = lambda x: x**2 2025-12-04T10:03:53.7047377Z >>> x = torch.randn(N, requires_grad=True) 2025-12-04T10:03:53.7047606Z >>> y = f(x) 2025-12-04T10:03:53.7047776Z >>> I_N = torch.eye(N) 2025-12-04T10:03:53.7047946Z >>> 2025-12-04T10:03:53.7048102Z >>> # Sequential approach 2025-12-04T10:03:53.7048384Z >>> jacobian_rows = [torch.autograd.grad(y, x, v, retain_graph=True)[0] 2025-12-04T10:03:53.7048678Z >>> for v in I_N.unbind()] 2025-12-04T10:03:53.7048920Z >>> jacobian = torch.stack(jacobian_rows) 2025-12-04T10:03:53.7049139Z >>> 2025-12-04T10:03:53.7049346Z >>> # vectorized gradient computation 2025-12-04T10:03:53.7049600Z >>> def get_vjp(v): 2025-12-04T10:03:53.7049812Z >>> return torch.autograd.grad(y, x, v) 2025-12-04T10:03:53.7050050Z >>> jacobian = torch.vmap(get_vjp)(I_N) 2025-12-04T10:03:53.7050260Z 2025-12-04T10:03:53.7050521Z :func:`vmap` can also be nested, producing an output with multiple batched dimensions 2025-12-04T10:03:53.7050832Z 2025-12-04T10:03:53.7050976Z >>> torch.dot # [D], [D] -> [] 2025-12-04T10:03:53.7051199Z >>> batched_dot = torch.vmap( 2025-12-04T10:03:53.7051410Z ... torch.vmap(torch.dot) 2025-12-04T10:03:53.7051633Z ... ) # [N1, N0, D], [N1, N0, D] -> [N1, N0] 2025-12-04T10:03:53.7051885Z >>> x, y = torch.randn(2, 3, 5), torch.randn(2, 3, 5) 2025-12-04T10:03:53.7052143Z >>> batched_dot(x, y) # tensor of size [2, 3] 2025-12-04T10:03:53.7052349Z 2025-12-04T10:03:53.7052590Z If the inputs are not batched along the first dimension, ``in_dims`` specifies 2025-12-04T10:03:53.7052947Z the dimension that each inputs are batched along as 2025-12-04T10:03:53.7053175Z 2025-12-04T10:03:53.7053324Z >>> torch.dot # [N], [N] -> [] 2025-12-04T10:03:53.7053615Z >>> batched_dot = torch.vmap(torch.dot, in_dims=1) # [N, D], [N, D] -> [D] 2025-12-04T10:03:53.7053928Z >>> x, y = torch.randn(2, 5), torch.randn(2, 5) 2025-12-04T10:03:53.7054144Z >>> batched_dot( 2025-12-04T10:03:53.7054311Z ... x, y 2025-12-04T10:03:53.7054541Z ... ) # output is [5] instead of [2] if batched along the 0th dimension 2025-12-04T10:03:53.7054794Z 2025-12-04T10:03:53.7055045Z If there are multiple inputs each of which is batched along different dimensions, 2025-12-04T10:03:53.7055616Z ``in_dims`` must be a tuple with the batch dimension for each input as 2025-12-04T10:03:53.7055876Z 2025-12-04T10:03:53.7056019Z >>> torch.dot # [D], [D] -> [] 2025-12-04T10:03:53.7056325Z >>> batched_dot = torch.vmap(torch.dot, in_dims=(0, None)) # [N, D], [D] -> [N] 2025-12-04T10:03:53.7056645Z >>> x, y = torch.randn(2, 5), torch.randn(5) 2025-12-04T10:03:53.7056859Z >>> batched_dot( 2025-12-04T10:03:53.7057039Z ... x, y 2025-12-04T10:03:53.7057271Z ... ) # second arg doesn't have a batch dim because in_dim[1] was None 2025-12-04T10:03:53.7057533Z 2025-12-04T10:03:53.7057869Z If the input is a Python struct, ``in_dims`` must be a tuple containing a struct 2025-12-04T10:03:53.7058189Z matching the shape of the input: 2025-12-04T10:03:53.7058379Z 2025-12-04T10:03:53.7058552Z >>> f = lambda dict: torch.dot(dict["x"], dict["y"]) 2025-12-04T10:03:53.7058808Z >>> x, y = torch.randn(2, 5), torch.randn(5) 2025-12-04T10:03:53.7059026Z >>> input = {"x": x, "y": y} 2025-12-04T10:03:53.7059284Z >>> batched_dot = torch.vmap(f, in_dims=({"x": 0, "y": None},)) 2025-12-04T10:03:53.7059616Z >>> batched_dot(input) 2025-12-04T10:03:53.7059797Z 2025-12-04T10:03:53.7060056Z By default, the output is batched along the first dimension. However, it can be batched 2025-12-04T10:03:53.7068841Z along any dimension by using ``out_dims`` 2025-12-04T10:03:53.7069088Z 2025-12-04T10:03:53.7069251Z >>> f = lambda x: x**2 2025-12-04T10:03:53.7069462Z >>> x = torch.randn(2, 5) 2025-12-04T10:03:53.7069700Z >>> batched_pow = torch.vmap(f, out_dims=1) 2025-12-04T10:03:53.7069938Z >>> batched_pow(x) # [5, 2] 2025-12-04T10:03:53.7070136Z 2025-12-04T10:03:53.7070423Z For any function that uses kwargs, the returned function will not batch the kwargs but will 2025-12-04T10:03:53.7070762Z accept kwargs 2025-12-04T10:03:53.7070925Z 2025-12-04T10:03:53.7071075Z >>> x = torch.randn([2, 5]) 2025-12-04T10:03:53.7071281Z >>> def fn(x, scale=4.): 2025-12-04T10:03:53.7071474Z >>> return x * scale 2025-12-04T10:03:53.7071786Z >>> 2025-12-04T10:03:53.7071947Z >>> batched_pow = torch.vmap(fn) 2025-12-04T10:03:53.7072270Z >>> assert torch.allclose(batched_pow(x), x * 4) 2025-12-04T10:03:53.7072604Z >>> batched_pow(x, scale=x) # scale is not batched, output has shape [2, 2, 5] 2025-12-04T10:03:53.7072899Z 2025-12-04T10:03:53.7073033Z .. note:: 2025-12-04T10:03:53.7073284Z vmap does not provide general autobatching or handle variable-length 2025-12-04T10:03:53.7073589Z sequences out of the box. 2025-12-04T10:03:53.7073774Z 2025-12-04T10:03:53.7074206Z Original Error: IndentationError('expected an indented block after function definition on line 4', ('', 5, 1, '_._ = None\n', 5, 2)) 2025-12-04T10:03:53.7074690Z 2025-12-04T10:03:53.7074815Z _._ = None 2025-12-04T10:03:53.7074965Z ^ 2025-12-04T10:03:53.7075114Z warnings.warn(msg) 2025-12-04T10:03:53.7075286Z 2025-12-04T10:03:53.7075553Z --- Parse Warning: 8 / 17 --- 2025-12-04T10:03:53.7076294Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=grad in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_functorch/apis.py line=306. 2025-12-04T10:03:53.7077087Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:53.7077496Z ``grad`` operator helps computing gradients of ``func`` with respect to the 2025-12-04T10:03:53.7077874Z input(s) specified by ``argnums``. This operator can be nested to 2025-12-04T10:03:53.7078169Z compute higher-order gradients. 2025-12-04T10:03:53.7078377Z 2025-12-04T10:03:53.7078508Z Args: 2025-12-04T10:03:53.7078746Z func (Callable): A Python function that takes one or more arguments. 2025-12-04T10:03:53.7079149Z Must return a single-element Tensor. If specified ``has_aux`` equals ``True``, 2025-12-04T10:03:53.7079583Z function can return a tuple of single-element Tensor and other auxiliary objects: 2025-12-04T10:03:53.7079920Z ``(output, aux)``. 2025-12-04T10:03:53.7080247Z argnums (int or Tuple[int]): Specifies arguments to compute gradients with respect to. 2025-12-04T10:03:53.7080663Z ``argnums`` can be single integer or tuple of integers. Default: 0. 2025-12-04T10:03:53.7081087Z has_aux (bool): Flag indicating that ``func`` returns a tensor and other 2025-12-04T10:03:53.7081436Z auxiliary objects: ``(output, aux)``. Default: False. 2025-12-04T10:03:53.7081677Z 2025-12-04T10:03:53.7081810Z Returns: 2025-12-04T10:03:53.7082099Z Function to compute gradients with respect to its inputs. By default, the output of 2025-12-04T10:03:53.7082532Z the function is the gradient tensor(s) with respect to the first argument. 2025-12-04T10:03:53.7082943Z If specified ``has_aux`` equals ``True``, tuple of gradients and output auxiliary objects 2025-12-04T10:03:53.7083436Z is returned. If ``argnums`` is a tuple of integers, a tuple of output gradients with 2025-12-04T10:03:53.7083788Z respect to each ``argnums`` value is returned. 2025-12-04T10:03:53.7084023Z 2025-12-04T10:03:53.7084167Z Example of using ``grad``: 2025-12-04T10:03:53.7084358Z 2025-12-04T10:03:53.7084505Z >>> # xdoctest: +SKIP 2025-12-04T10:03:53.7084711Z >>> from torch.func import grad 2025-12-04T10:03:53.7084941Z >>> x = torch.randn([]) 2025-12-04T10:03:53.7085173Z >>> cos_x = grad(lambda x: torch.sin(x))(x) 2025-12-04T10:03:53.7085421Z >>> assert torch.allclose(cos_x, x.cos()) 2025-12-04T10:03:53.7085641Z >>> 2025-12-04T10:03:53.7085810Z >>> # Second-order gradients 2025-12-04T10:03:53.7086058Z >>> neg_sin_x = grad(grad(lambda x: torch.sin(x)))(x) 2025-12-04T10:03:53.7086325Z >>> assert torch.allclose(neg_sin_x, -x.sin()) 2025-12-04T10:03:53.7086595Z 2025-12-04T10:03:53.7086882Z When composed with ``vmap``, ``grad`` can be used to compute per-sample-gradients: 2025-12-04T10:03:53.7087185Z 2025-12-04T10:03:53.7087327Z >>> # xdoctest: +SKIP 2025-12-04T10:03:53.7087541Z >>> from torch.func import grad, vmap 2025-12-04T10:03:53.7087777Z >>> batch_size, feature_size = 3, 5 2025-12-04T10:03:53.7087991Z >>> 2025-12-04T10:03:53.7088167Z >>> def model(weights, feature_vec): 2025-12-04T10:03:53.7088414Z >>> # Very simple linear model with activation 2025-12-04T10:03:53.7088661Z >>> assert feature_vec.dim() == 1 2025-12-04T10:03:53.7088906Z >>> return feature_vec.dot(weights).relu() 2025-12-04T10:03:53.7089130Z >>> 2025-12-04T10:03:53.7089314Z >>> def compute_loss(weights, example, target): 2025-12-04T10:03:53.7089559Z >>> y = model(weights, example) 2025-12-04T10:03:53.7089813Z >>> return ((y - target) ** 2).mean() # MSELoss 2025-12-04T10:03:53.7090040Z >>> 2025-12-04T10:03:53.7090252Z >>> weights = torch.randn(feature_size, requires_grad=True) 2025-12-04T10:03:53.7090553Z >>> examples = torch.randn(batch_size, feature_size) 2025-12-04T10:03:53.7090806Z >>> targets = torch.randn(batch_size) 2025-12-04T10:03:53.7091050Z >>> inputs = (weights, examples, targets) 2025-12-04T10:03:53.7091377Z >>> grad_weight_per_example = vmap(grad(compute_loss), in_dims=(None, 0, 0))( 2025-12-04T10:03:53.7091680Z ... *inputs 2025-12-04T10:03:53.7091845Z ... ) 2025-12-04T10:03:53.7092009Z 2025-12-04T10:03:53.7092215Z Example of using ``grad`` with ``has_aux`` and ``argnums``: 2025-12-04T10:03:53.7092461Z 2025-12-04T10:03:53.7092606Z >>> # xdoctest: +SKIP 2025-12-04T10:03:53.7095001Z >>> from torch.func import grad 2025-12-04T10:03:53.7095306Z >>> def my_loss_func(y, y_pred): 2025-12-04T10:03:53.7095562Z >>> loss_per_sample = (0.5 * y_pred - y) ** 2 2025-12-04T10:03:53.7095822Z >>> loss = loss_per_sample.mean() 2025-12-04T10:03:53.7096070Z >>> return loss, (y_pred, loss_per_sample) 2025-12-04T10:03:53.7096291Z >>> 2025-12-04T10:03:53.7096496Z >>> fn = grad(my_loss_func, argnums=(0, 1), has_aux=True) 2025-12-04T10:03:53.7096828Z >>> y_true = torch.rand(4) 2025-12-04T10:03:53.7097079Z >>> y_preds = torch.rand(4, requires_grad=True) 2025-12-04T10:03:53.7097326Z >>> out = fn(y_true, y_preds) 2025-12-04T10:03:53.7101154Z >>> # > output is ((grads w.r.t y_true, grads w.r.t y_preds), (y_pred, loss_per_sample)) 2025-12-04T10:03:53.7101516Z 2025-12-04T10:03:53.7101667Z .. note:: 2025-12-04T10:03:53.7101891Z Using PyTorch ``torch.no_grad`` together with ``grad``. 2025-12-04T10:03:53.7102170Z 2025-12-04T10:03:53.7102431Z Case 1: Using ``torch.no_grad`` inside a function: 2025-12-04T10:03:53.7102667Z 2025-12-04T10:03:53.7102818Z >>> # xdoctest: +SKIP 2025-12-04T10:03:53.7103024Z >>> def f(x): 2025-12-04T10:03:53.7103215Z >>> with torch.no_grad(): 2025-12-04T10:03:53.7103428Z >>> c = x ** 2 2025-12-04T10:03:53.7103630Z >>> return x - c 2025-12-04T10:03:53.7103812Z 2025-12-04T10:03:53.7104024Z In this case, ``grad(f)(x)`` will respect the inner ``torch.no_grad``. 2025-12-04T10:03:53.7104291Z 2025-12-04T10:03:53.7104485Z Case 2: Using ``grad`` inside ``torch.no_grad`` context manager: 2025-12-04T10:03:53.7104751Z 2025-12-04T10:03:53.7104909Z >>> # xdoctest: +SKIP 2025-12-04T10:03:53.7105110Z >>> with torch.no_grad(): 2025-12-04T10:03:53.7105319Z >>> grad(f)(x) 2025-12-04T10:03:53.7105503Z 2025-12-04T10:03:53.7105723Z In this case, ``grad`` will respect the inner ``torch.no_grad``, but not the 2025-12-04T10:03:53.7106153Z outer one. This is because ``grad`` is a "function transform": its result 2025-12-04T10:03:53.7106530Z should not depend on the result of a context manager outside of ``f``. 2025-12-04T10:03:53.7106862Z 2025-12-04T10:03:53.7106987Z 2025-12-04T10:03:53.7107510Z Original Error: IndentationError('expected an indented block after function definition on line 5', ('', 6, 1, '_._ = None\n', 6, 2)) 2025-12-04T10:03:53.7107999Z 2025-12-04T10:03:53.7108125Z _._ = None 2025-12-04T10:03:53.7108272Z ^ 2025-12-04T10:03:53.7108421Z warnings.warn(msg) 2025-12-04T10:03:53.7108584Z 2025-12-04T10:03:53.7108838Z --- Parse Warning: 9 / 17 --- 2025-12-04T10:03:53.7109659Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=CustomOpDef.register_fake in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/custom_ops.py line=402. 2025-12-04T10:03:53.7110555Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:53.7110926Z Register a FakeTensor implementation for this custom op. 2025-12-04T10:03:53.7111182Z 2025-12-04T10:03:53.7111431Z This is necessary to get the operator to work efficiently with torch.compile. 2025-12-04T10:03:53.7111732Z 2025-12-04T10:03:53.7111977Z The Fake impl (sometimes also known as a meta kernel or abstract impl) 2025-12-04T10:03:53.7112358Z specifies the behavior of this operator on Tensors that carry no data. 2025-12-04T10:03:53.7112692Z Given some input Tensors with certain properties 2025-12-04T10:03:53.7113050Z (sizes/strides/storage_offset/device), it specifies what the properties of 2025-12-04T10:03:53.7113375Z the output Tensors are. 2025-12-04T10:03:53.7113653Z 2025-12-04T10:03:53.7113879Z Please see :func:`torch.library.register_fake` for more details. 2025-12-04T10:03:53.7114152Z 2025-12-04T10:03:53.7114283Z Args: 2025-12-04T10:03:53.7114497Z fn (Callable): The function to register as the FakeTensor 2025-12-04T10:03:53.7114779Z implementation. 2025-12-04T10:03:53.7114976Z 2025-12-04T10:03:53.7115108Z Examples: 2025-12-04T10:03:53.7115278Z >>> import torch 2025-12-04T10:03:53.7115477Z >>> import numpy as np 2025-12-04T10:03:53.7115689Z >>> from torch import Tensor 2025-12-04T10:03:53.7115893Z >>> 2025-12-04T10:03:53.7116117Z >>> # Example 1: an operator without data-dependent output shape 2025-12-04T10:03:53.7116519Z >>> @torch.library.custom_op("mylib::linear", mutates_args=()) 2025-12-04T10:03:53.7116862Z >>> def linear(x: Tensor, weight: Tensor, bias: Tensor) -> Tensor: 2025-12-04T10:03:53.7117196Z >>> return (x @ weight.t()) + bias 2025-12-04T10:03:53.7117405Z >>> 2025-12-04T10:03:53.7117568Z >>> @linear.register_fake 2025-12-04T10:03:53.7117802Z >>> def _(x, weight, bias): 2025-12-04T10:03:53.7118020Z >>> assert x.dim() == 2 2025-12-04T10:03:53.7118234Z >>> assert weight.dim() == 2 2025-12-04T10:03:53.7118467Z >>> assert bias.dim() == 1 2025-12-04T10:03:53.7118701Z >>> assert x.shape[1] == weight.shape[1] 2025-12-04T10:03:53.7118958Z >>> assert weight.shape[0] == bias.shape[0] 2025-12-04T10:03:53.7119206Z >>> assert x.device == weight.device 2025-12-04T10:03:53.7119471Z >>> return x.new_empty(x.size(0), weight.size(0)) 2025-12-04T10:03:53.7119710Z >>> 2025-12-04T10:03:53.7119868Z >>> x = torch.randn(2, 2) 2025-12-04T10:03:53.7120092Z >>> weight = torch.randn(2, 2) 2025-12-04T10:03:53.7120375Z >>> bias = torch.randn(2) 2025-12-04T10:03:53.7120604Z >>> # xdoctest: +SKIP("Requires Python <= 3.11") 2025-12-04T10:03:53.7120908Z >>> out = torch.compile(linear, fullgraph=True)(x, weight, bias) 2025-12-04T10:03:53.7121206Z >>> # xdoctest: +SKIP("Requires Python <= 3.11") 2025-12-04T10:03:53.7121544Z >>> assert torch.allclose(out, torch.nn.functional.linear(x, weight, bias)) 2025-12-04T10:03:53.7121846Z >>> 2025-12-04T10:03:53.7122062Z >>> # Example 2: an operator with data-dependent output shape 2025-12-04T10:03:53.7122398Z >>> @torch.library.custom_op("mylib::nonzero", mutates_args=()) 2025-12-04T10:03:53.7122694Z >>> def nonzero(x: Tensor) -> Tensor: 2025-12-04T10:03:53.7122929Z >>> x_np = x.cpu().numpy() 2025-12-04T10:03:53.7123164Z >>> res = np.stack(np.nonzero(x_np), axis=1) 2025-12-04T10:03:53.7123432Z >>> return torch.tensor(res, device=x.device) 2025-12-04T10:03:53.7123654Z >>> 2025-12-04T10:03:53.7123825Z >>> @nonzero.register_fake 2025-12-04T10:03:53.7124050Z >>> def _(x): 2025-12-04T10:03:53.7124278Z >>> # Number of nonzero-elements is data-dependent. 2025-12-04T10:03:53.7124575Z >>> # Since we cannot peek at the data in an abstract impl, 2025-12-04T10:03:53.7124873Z >>> # we use the ctx object to construct a new symint that 2025-12-04T10:03:53.7125148Z >>> # represents the data-dependent size. 2025-12-04T10:03:53.7125391Z >>> ctx = torch.library.get_ctx() 2025-12-04T10:03:53.7125628Z >>> nnz = ctx.new_dynamic_size() 2025-12-04T10:03:53.7125861Z >>> shape = [nnz, x.dim()] 2025-12-04T10:03:53.7126166Z >>> result = x.new_empty(shape, dtype=torch.int64) 2025-12-04T10:03:53.7126418Z >>> return result 2025-12-04T10:03:53.7126609Z >>> 2025-12-04T10:03:53.7126777Z >>> x = torch.tensor([0, 1, 2, 0, 0, 1]) 2025-12-04T10:03:53.7127042Z >>> # xdoctest: +SKIP("Requires Python <= 3.11") 2025-12-04T10:03:53.7127308Z >>> out = torch.compile(nonzero, fullgraph=True)(x) 2025-12-04T10:03:53.7127571Z >>> # xdoctest: +SKIP("Requires Python <= 3.11") 2025-12-04T10:03:53.7127821Z >>> assert torch.allclose(out, x.nonzero()) 2025-12-04T10:03:53.7128038Z 2025-12-04T10:03:53.7128172Z 2025-12-04T10:03:53.7128670Z Original Error: IndentationError('expected an indented block after function definition on line 36', ('', 37, 1, '_._ = None\n', 37, 2)) 2025-12-04T10:03:53.7129152Z 2025-12-04T10:03:53.7129279Z _._ = None 2025-12-04T10:03:53.7129413Z ^ 2025-12-04T10:03:53.7129558Z warnings.warn(msg) 2025-12-04T10:03:53.7129764Z 2025-12-04T10:03:53.7129973Z --- Parse Warning: 10 / 17 --- 2025-12-04T10:03:53.7130780Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=unsafe_generate_fake_kernels in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_library/fake_profile.py line=94. 2025-12-04T10:03:53.7131686Z Caused by: DoctestParseError('Failed to parse doctest in _label_docsrc_lines') 2025-12-04T10:03:53.7131997Z 2025-12-04T10:03:53.7132219Z Registers a fake kernel based on the given operator profiles. This fake 2025-12-04T10:03:53.7132612Z kernel registration will override any existing fake kernel registrations. 2025-12-04T10:03:53.7132904Z 2025-12-04T10:03:53.7133117Z The input is a dictionary mapping operator names to a set of operator 2025-12-04T10:03:53.7133497Z profiles, which we will use to generate fake kernels. The operator profiles 2025-12-04T10:03:53.7133868Z are a record of the input and output tensor metadata. Based on this 2025-12-04T10:03:53.7134291Z information we will match a given input to the recorded profile, and return 2025-12-04T10:03:53.7134673Z an output with the same metadata as in the recorded profile. If a profile 2025-12-04T10:03:53.7134996Z doesn't exist then an exception will be thrown. 2025-12-04T10:03:53.7135220Z 2025-12-04T10:03:53.7135442Z The fake kernel generation is considered unsafe because it relies on the 2025-12-04T10:03:53.7135826Z rigid, pre-defined operator profiles that do not account for potential 2025-12-04T10:03:53.7136218Z variations in output behavior. Specifically, the generated kernels assume a 2025-12-04T10:03:53.7136629Z fixed relationship between input and output ranks. However, in reality, it's 2025-12-04T10:03:53.7137034Z possible that data-dependent operations may produce outputs of different 2025-12-04T10:03:53.7137434Z ranks even when given inputs of the same rank. The generated fake kernels 2025-12-04T10:03:53.7137809Z are inflexible and unable to accommodate these nuances, making them 2025-12-04T10:03:53.7138090Z potentially unsafe. 2025-12-04T10:03:53.7138254Z 2025-12-04T10:03:53.7138382Z Args: 2025-12-04T10:03:53.7138606Z op_profiles (dict[str, set[OpProfile]]): A dictionary mapping operator 2025-12-04T10:03:53.7138964Z name to a set of operator profiles from which we will generate fake 2025-12-04T10:03:53.7139233Z kernels. 2025-12-04T10:03:53.7139388Z 2025-12-04T10:03:53.7139519Z Examples: 2025-12-04T10:03:53.7139664Z 2025-12-04T10:03:53.7139857Z >>> # Example: Registering an op-profile from draft-export 2025-12-04T10:03:53.7140107Z >>> import torch 2025-12-04T10:03:53.7140330Z >>> from torch.export._draft_export import draft_export 2025-12-04T10:03:53.7140825Z >>> 2025-12-04T10:03:53.7141084Z >>> @torch.library.custom_op("mylib::foo", mutates_args=()) 2025-12-04T10:03:53.7141443Z >>> def foo(x: Tensor, y: Tensor) -> Tensor: 2025-12-04T10:03:53.7141672Z >>> return x + y 2025-12-04T10:03:53.7141838Z >>> 2025-12-04T10:03:53.7141998Z >>> class M(torch.nn.Module): 2025-12-04T10:03:53.7142212Z >>> def forward(self, a, b): 2025-12-04T10:03:53.7142453Z >>> res = torch.ops.mylib.foo(a, b) # no fake impl 2025-12-04T10:03:53.7142683Z >>> return res 2025-12-04T10:03:53.7142852Z >>> 2025-12-04T10:03:53.7143045Z >>> ep = draft_export(M(), (torch.ones(3, 4), torch.ones(3, 4)) 2025-12-04T10:03:53.7143286Z >>> 2025-12-04T10:03:53.7143565Z >>> with torch._library.fake_profile.unsafe_generate_fake_kernels(ep._report.op_profiles): 2025-12-04T10:03:53.7143985Z >>> decomp = ep.run_decompositions() 2025-12-04T10:03:53.7144193Z 2025-12-04T10:03:53.7144323Z 2025-12-04T10:03:53.7144721Z Original Error: IncompleteParseError('ill-formed doctest: all parts have been processed but the doctest source is not balanced') 2025-12-04T10:03:53.7145207Z 2025-12-04T10:03:53.7145360Z warnings.warn(msg) 2025-12-04T10:03:53.7145527Z 2025-12-04T10:03:53.7145741Z --- Parse Warning: 11 / 17 --- 2025-12-04T10:03:53.7146695Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=ActivationSparsifier in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/ao/pruning/_experimental/activation_sparsifier/activation_sparsifier.py line=16. 2025-12-04T10:03:53.7147828Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:53.7148128Z 2025-12-04T10:03:53.7148374Z The Activation sparsifier class aims to sparsify/prune activations in a neural 2025-12-04T10:03:53.7148786Z network. The idea is to attach the sparsifier to a layer (or layers) and it 2025-12-04T10:03:53.7149185Z zeroes out the activations based on the mask_fn (or sparsification function) 2025-12-04T10:03:53.7149544Z input by the user. 2025-12-04T10:03:53.7149792Z The mask_fn is applied once all the inputs are aggregated and reduced i.e. 2025-12-04T10:03:53.7150118Z mask = mask_fn(reduce_fn(aggregate_fn(activations))) 2025-12-04T10:03:53.7150352Z 2025-12-04T10:03:53.7150492Z Note:: 2025-12-04T10:03:53.7150794Z The sparsification mask is computed on the input **before it goes through the attached layer**. 2025-12-04T10:03:53.7151139Z 2025-12-04T10:03:53.7151264Z Args: 2025-12-04T10:03:53.7151411Z model (nn.Module): 2025-12-04T10:03:53.7151677Z The model whose layers will be sparsified. The layers that needs to be 2025-12-04T10:03:53.7152076Z sparsified should be added separately using the register_layer() function 2025-12-04T10:03:53.7152391Z aggregate_fn (Optional, Callable): 2025-12-04T10:03:53.7152715Z default aggregate_fn that is used if not specified while registering the layer. 2025-12-04T10:03:53.7153084Z specifies how inputs should be aggregated over time. 2025-12-04T10:03:53.7153469Z The aggregate_fn should usually take 2 torch tensors and return the aggregated tensor. 2025-12-04T10:03:53.7153801Z Example 2025-12-04T10:03:53.7154023Z def add_agg_fn(tensor1, tensor2): return tensor1 + tensor2 2025-12-04T10:03:53.7154296Z reduce_fn (Optional, Callable): 2025-12-04T10:03:53.7154608Z default reduce_fn that is used if not specified while registering the layer. 2025-12-04T10:03:53.7155022Z reduce_fn will be called on the aggregated tensor i.e. the tensor obtained after 2025-12-04T10:03:53.7155617Z calling agg_fn() on all inputs. 2025-12-04T10:03:53.7155840Z Example 2025-12-04T10:03:53.7156092Z def mean_reduce_fn(agg_tensor): return agg_tensor.mean(dim=0) 2025-12-04T10:03:53.7156489Z mask_fn (Optional, Callable): 2025-12-04T10:03:53.7156870Z default mask_fn that is used to create the sparsification mask using the tensor obtained after 2025-12-04T10:03:53.7157328Z calling the reduce_fn(). This is used by default if a custom one is passed in the 2025-12-04T10:03:53.7157648Z register_layer(). 2025-12-04T10:03:53.7158018Z Note that the mask_fn() definition should contain the sparse arguments that is passed in sparse_config 2025-12-04T10:03:53.7158387Z arguments. 2025-12-04T10:03:53.7158583Z features (Optional, list): 2025-12-04T10:03:53.7158818Z default selected features to sparsify. 2025-12-04T10:03:53.7159230Z If this is non-empty, then the mask_fn will be applied for each feature of the input. 2025-12-04T10:03:53.7159555Z For example, 2025-12-04T10:03:53.7159857Z mask = [mask_fn(reduce_fn(aggregated_fn(input[feature])) for feature in features] 2025-12-04T10:03:53.7160258Z feature_dim (Optional, int): 2025-12-04T10:03:53.7160592Z default dimension of input features. Again, features along this dim will be chosen 2025-12-04T10:03:53.7160932Z for sparsification. 2025-12-04T10:03:53.7161144Z sparse_config (Dict): 2025-12-04T10:03:53.7161428Z Default configuration for the mask_fn. This config will be passed 2025-12-04T10:03:53.7161728Z with the mask_fn() 2025-12-04T10:03:53.7161918Z 2025-12-04T10:03:53.7162051Z Example: 2025-12-04T10:03:53.7162200Z >>> # xdoctest: +SKIP 2025-12-04T10:03:53.7162388Z >>> model = SomeModel() 2025-12-04T10:03:53.7162685Z >>> act_sparsifier = ActivationSparsifier(...) # init activation sparsifier 2025-12-04T10:03:53.7163002Z >>> # Initialize aggregate_fn 2025-12-04T10:03:53.7163202Z >>> def agg_fn(x, y): 2025-12-04T10:03:53.7163387Z >>> return x + y 2025-12-04T10:03:53.7163624Z >>> 2025-12-04T10:03:53.7163777Z >>> # Initialize reduce_fn 2025-12-04T10:03:53.7163976Z >>> def reduce_fn(x): 2025-12-04T10:03:53.7164172Z >>> return torch.mean(x, dim=0) 2025-12-04T10:03:53.7164376Z >>> 2025-12-04T10:03:53.7164520Z >>> # Initialize mask_fn 2025-12-04T10:03:53.7164704Z >>> def mask_fn(data): 2025-12-04T10:03:53.7164925Z >>> return torch.eye(data.shape).to(data.device) 2025-12-04T10:03:53.7165153Z >>> 2025-12-04T10:03:53.7165284Z >>> 2025-12-04T10:03:53.7165437Z >>> act_sparsifier.register_layer( 2025-12-04T10:03:53.7165654Z ... model.some_layer, 2025-12-04T10:03:53.7165850Z ... aggregate_fn=agg_fn, 2025-12-04T10:03:53.7166048Z ... reduce_fn=reduce_fn, 2025-12-04T10:03:53.7166239Z ... mask_fn=mask_fn, 2025-12-04T10:03:53.7166420Z ... ) 2025-12-04T10:03:53.7166556Z >>> 2025-12-04T10:03:53.7166705Z >>> # start training process 2025-12-04T10:03:53.7166905Z >>> for _ in [...]: 2025-12-04T10:03:53.7167078Z >>> # epoch starts 2025-12-04T10:03:53.7167310Z >>> # model.forward(), compute_loss() and model.backwards() 2025-12-04T10:03:53.7167571Z >>> # epoch ends 2025-12-04T10:03:53.7167749Z >>> act_sparsifier.step() 2025-12-04T10:03:53.7167956Z >>> # end training process 2025-12-04T10:03:53.7168158Z >>> sparsifier.squash_mask() 2025-12-04T10:03:53.7168342Z 2025-12-04T10:03:53.7168752Z Original Error: IndentationError("expected an indented block after 'for' statement on line 25", ('', 26, 1, '_._ = None\n', 26, 2)) 2025-12-04T10:03:53.7169211Z 2025-12-04T10:03:53.7169338Z _._ = None 2025-12-04T10:03:53.7169478Z ^ 2025-12-04T10:03:53.7169621Z warnings.warn(msg) 2025-12-04T10:03:53.7169785Z 2025-12-04T10:03:53.7170006Z --- Parse Warning: 12 / 17 --- 2025-12-04T10:03:53.7170864Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=DeviceMesh.__getitem__ in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/device_mesh.py line=547. 2025-12-04T10:03:53.7171737Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:53.7172035Z 2025-12-04T10:03:53.7172283Z Slice the current DeviceMesh based on the mesh_dim_names given to create a submesh. 2025-12-04T10:03:53.7172727Z The submesh created consists of the dimensions and the communicators indicated by 2025-12-04T10:03:53.7173057Z ``mesh_dim_names`` 2025-12-04T10:03:53.7173208Z 2025-12-04T10:03:53.7173334Z Args: 2025-12-04T10:03:53.7173615Z mesh_dim_names (Union[str, Tuple[str]]): the name or the tuple of names of the 2025-12-04T10:03:53.7173976Z mesh dimension of the DeviceMesh to create the submesh for. 2025-12-04T10:03:53.7174232Z Returns: 2025-12-04T10:03:53.7174391Z A :class:`DeviceMesh` object 2025-12-04T10:03:53.7174621Z 2025-12-04T10:03:53.7174878Z The following program runs on each process/rank in an SPMD manner in a world size of 8. 2025-12-04T10:03:53.7175213Z In the first example: 2025-12-04T10:03:53.7175499Z Calling mesh_2d["tp"] on rank 0, 1, 2, 3 returns a 1D submesh of DeviceMesh:([0, 1, 2, 3]). 2025-12-04T10:03:53.7175905Z Calling mesh_2d["tp"] on rank 4, 5, 6, 7 returns a 1D submesh of DeviceMesh:([4, 5, 6, 7]). 2025-12-04T10:03:53.7176299Z Calling mesh_2d["dp"] on rank 0, 4 returns a 1D submesh of DeviceMesh:([0, 4]). 2025-12-04T10:03:53.7176673Z Calling mesh_2d["dp"] on rank 1, 5 returns a 1D submesh of DeviceMesh:([1, 5]). 2025-12-04T10:03:53.7177042Z Calling mesh_2d["dp"] on rank 2, 6 returns a 1D submesh of DeviceMesh:([2, 6]). 2025-12-04T10:03:53.7177400Z Calling mesh_2d["dp"] on rank 3, 7 returns a 1D submesh of DeviceMesh:([3, 7]). 2025-12-04T10:03:53.7177677Z 2025-12-04T10:03:53.7177821Z In the second example: 2025-12-04T10:03:53.7178169Z Calling mesh_3d["dp", "cp"] on rank 0, 1, 4, 5 returns a 2D submesh of DeviceMesh:([[0, 1], [4, 5]]). 2025-12-04T10:03:53.7178601Z Calling mesh_3d["dp", "cp"] on rank 2, 3, 6, 7 returns a 2D submesh of DeviceMesh:([[2, 3], [6, 7]]). 2025-12-04T10:03:53.7179027Z Calling mesh_3d["cp", "dp"] on rank 0, 1, 4, 5 returns a 2D submesh of DeviceMesh:([[0, 4], [1, 5]]). 2025-12-04T10:03:53.7179448Z Calling mesh_3d["cp", "dp"] on rank 2, 3, 6, 7 returns a 2D submesh of DeviceMesh:([[2, 6], [3, 7]]). 2025-12-04T10:03:53.7179749Z 2025-12-04T10:03:53.7179892Z Example:: 2025-12-04T10:03:53.7180041Z 2025-12-04T10:03:53.7180249Z >>> # xdoctest: +SKIP("no rank") 2025-12-04T10:03:53.7180704Z >>> from torch.distributed.device_mesh import DeviceMesh 2025-12-04T10:03:53.7181125Z >>> 2025-12-04T10:03:53.7181485Z >>> # Initialize a 2D device mesh as (2, 4) to represent the topology 2025-12-04T10:03:53.7181870Z >>> # of cross-host(dim 0), and within-host (dim 1). 2025-12-04T10:03:53.7182221Z >>> mesh_2d = init_device_mesh(device_type="cuda", (2,4), mesh_dim_names=("dp", "tp")) 2025-12-04T10:03:53.7182543Z >>> tp_mesh = mesh_2d["tp"] 2025-12-04T10:03:53.7182746Z >>> dp_mesh = mesh_2d["dp"] 2025-12-04T10:03:53.7182927Z >>> 2025-12-04T10:03:53.7183077Z >>> # Initialize a 3D mesh. 2025-12-04T10:03:53.7183387Z >>> mesh_3d = init_device_mesh(device_type="cuda", (2,2,2), mesh_dim_names=("dp", "pp", "cp")) 2025-12-04T10:03:53.7183855Z >>> # The order of the mesh_dim_names provided deteremines the order of dimensions in the submesh. 2025-12-04T10:03:53.7184212Z >>> dp_cp_mesh = mesh_3d["dp", "cp"] 2025-12-04T10:03:53.7184433Z >>> cp_dp_mesh = mesh_3d["cp", "dp"] 2025-12-04T10:03:53.7184631Z 2025-12-04T10:03:53.7185238Z Original Error: SyntaxError('positional argument follows keyword argument', ('', 6, 82, 'mesh_2d = init_device_mesh(device_type="cuda", (2,4), mesh_dim_names=("dp", "tp"))\n', 6, 83)) 2025-12-04T10:03:53.7185828Z 2025-12-04T10:03:53.7186062Z mesh_2d = init_device_mesh(device_type="cuda", (2,4), mesh_dim_names=("dp", "tp")) 2025-12-04T10:03:53.7186390Z ^ 2025-12-04T10:03:53.7186616Z warnings.warn(msg) 2025-12-04T10:03:53.7186778Z 2025-12-04T10:03:53.7186997Z --- Parse Warning: 13 / 17 --- 2025-12-04T10:03:53.7187890Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=SavePlanner in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/planner.py line=122. 2025-12-04T10:03:53.7188817Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:53.7189114Z 2025-12-04T10:03:53.7189384Z Abstract class defining the protocol used by save_state_dict to plan the save process. 2025-12-04T10:03:53.7189773Z 2025-12-04T10:03:53.7190039Z SavePlanners are stateful objects that can be used to customize the whole save process. 2025-12-04T10:03:53.7190372Z 2025-12-04T10:03:53.7190631Z SavePlanner acts as an access proxy to the state_dict, so any transformation done to it 2025-12-04T10:03:53.7190973Z will be visible to the whole process. 2025-12-04T10:03:53.7191174Z 2025-12-04T10:03:53.7191453Z A planner subclass can expect the following sequence of calls during save_state_dict: 2025-12-04T10:03:53.7191778Z 2025-12-04T10:03:53.7191930Z 1) set_up_planner - called on all ranks. 2025-12-04T10:03:53.7192173Z Signals the start of a checkpoint save. 2025-12-04T10:03:53.7192391Z 2025-12-04T10:03:53.7192575Z 2) create_local_plan - called on all ranks. 2025-12-04T10:03:53.7192989Z Process the state_dict and produces a `SavePlan` that will be sent for global planning. 2025-12-04T10:03:53.7193381Z 2025-12-04T10:03:53.7193621Z 3) create_global_plan - called on the coordinator rank only. 2025-12-04T10:03:53.7194087Z Takes the SavePlan from all ranks and make any global decision. 2025-12-04T10:03:53.7194404Z 2025-12-04T10:03:53.7194582Z 4) finish_plan - called on all ranks. 2025-12-04T10:03:53.7194889Z This gives each rank a chance to adjust to global planning decisions. 2025-12-04T10:03:53.7195173Z 2025-12-04T10:03:53.7195354Z 5) resolve_data - called multiple times on each rank 2025-12-04T10:03:53.7195667Z Lookups a value on the `state_dict` for the storage layer to write. 2025-12-04T10:03:53.7195936Z 2025-12-04T10:03:53.7196218Z Users are recommended to extend DefaultSavePlanner instead of this interface directly as 2025-12-04T10:03:53.7196627Z most changes can be expressed by changes in a single method. 2025-12-04T10:03:53.7196885Z 2025-12-04T10:03:53.7197044Z There are 3 usual patterns of extension: 2025-12-04T10:03:53.7197250Z 2025-12-04T10:03:53.7197501Z Rewriting state_dict. This is the simplest way to extend the save process as it 2025-12-04T10:03:53.7197908Z doesn't requite understanding the intrincacies of how SavePlan works: 2025-12-04T10:03:53.7198189Z 2025-12-04T10:03:53.7198337Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T10:03:53.7198578Z >>> class RenamePlanner(DefaultSavePlanner): 2025-12-04T10:03:53.7198810Z >>> def set_up_planner( 2025-12-04T10:03:53.7198988Z >>> self, 2025-12-04T10:03:53.7199178Z >>> state_dict: STATE_DICT_TYPE, 2025-12-04T10:03:53.7199412Z >>> storage_meta: Optional[StorageMeta], 2025-12-04T10:03:53.7199643Z >>> is_coordinator: bool, 2025-12-04T10:03:53.7199841Z >>> ) -> None: 2025-12-04T10:03:53.7200021Z >>> # prefix all keys with `foo_`` 2025-12-04T10:03:53.7200356Z >>> super().set_up_planner({"foo_" + k: v for k, v in state_dict.items()}, storage_meta, is_coordinator) 2025-12-04T10:03:53.7200740Z 2025-12-04T10:03:53.7201047Z Modifying local plan and lookup in tandem. This is useful when fine control of how data is persisted 2025-12-04T10:03:53.7201406Z 2025-12-04T10:03:53.7201553Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T10:03:53.7201793Z >>> class FP16Planner(DefaultSavePlanner): 2025-12-04T10:03:53.7202026Z >>> def create_local_plan(self): 2025-12-04T10:03:53.7202253Z >>> plan = super().create_local_plan() 2025-12-04T10:03:53.7202478Z >>> for p in plan: 2025-12-04T10:03:53.7202683Z >>> if p.tensor_data is not None: 2025-12-04T10:03:53.7202948Z >>> p.tensor_data.properties.dtype = torch.float16 2025-12-04T10:03:53.7203203Z >>> return plan 2025-12-04T10:03:53.7203380Z >>> 2025-12-04T10:03:53.7203598Z >>> def resolve_data(self, write_item): 2025-12-04T10:03:53.7203840Z >>> item = super().resolve_data(write_item) 2025-12-04T10:03:53.7204191Z >>> return item if write_item.type == WriteItemType.BYTE_IO else item.to(torch.float16) 2025-12-04T10:03:53.7204549Z 2025-12-04T10:03:53.7204850Z Using the global planning step to make central decisions that can't be made individually by each rank 2025-12-04T10:03:53.7205213Z 2025-12-04T10:03:53.7205366Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T10:03:53.7205586Z >>> from itertools import zip_longest 2025-12-04T10:03:53.7205810Z >>> from dataclasses import replace 2025-12-04T10:03:53.7206074Z >>> class DDPLoadBalancingPlanner(DefaultSavePlanner): 2025-12-04T10:03:53.7206451Z >>> # This uses the default local plan behavior of having all non-sharded writes in rank 0 2025-12-04T10:03:53.7206799Z >>> # This sample doesn't handle ShardedTensors 2025-12-04T10:03:53.7207060Z >>> def create_global_plan(self, all_plans): 2025-12-04T10:03:53.7207333Z >>> iters = [iter(all_plans[0].items)] * len(all_plans) 2025-12-04T10:03:53.7207580Z >>> items_per_rank = [ 2025-12-04T10:03:53.7207855Z >>> [item for item in items if item is not None] 2025-12-04T10:03:53.7208131Z >>> for items in zip(*zip_longest(*iters), strict=True) 2025-12-04T10:03:53.7208374Z >>> ] 2025-12-04T10:03:53.7208545Z >>> all_plans = [ 2025-12-04T10:03:53.7208752Z >>> replace(plan, items=items) 2025-12-04T10:03:53.7209036Z >>> for plan, items in zip(all_plans, items_per_rank, strict=True) 2025-12-04T10:03:53.7209302Z >>> ] 2025-12-04T10:03:53.7209489Z >>> return super().create_global_plan(all_plans) 2025-12-04T10:03:53.7209713Z 2025-12-04T10:03:53.7209959Z Finally, some planners need to save additional metadata in the checkpoint, this is 2025-12-04T10:03:53.7210397Z accomplished by having each rank contribute their data items in the local plan and 2025-12-04T10:03:53.7210735Z the global planner aggregate them: 2025-12-04T10:03:53.7210927Z 2025-12-04T10:03:53.7211096Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T10:03:53.7211358Z >>> class SaveExtraDataPlanner(DefaultSavePlanner): 2025-12-04T10:03:53.7211625Z >>> def create_local_plan(self) -> SavePlan: 2025-12-04T10:03:53.7211862Z >>> plan = super().create_local_plan() 2025-12-04T10:03:53.7212133Z >>> return replace(plan, planner_data="per-rank-data") 2025-12-04T10:03:53.7212376Z >>> 2025-12-04T10:03:53.7212647Z >>> def create_global_plan(self, all_plans: List[SavePlan]) -> Tuple[List[SavePlan], Metadata]: 2025-12-04T10:03:53.7213068Z >>> global_plan, metadata = super().create_global_plan(all_plans) 2025-12-04T10:03:53.7213387Z >>> merged_data = [p.planner_data for p in global_plan] 2025-12-04T10:03:53.7213677Z >>> metadata = replace(metadata, planner_data=merged_data) 2025-12-04T10:03:53.7213942Z >>> return global_plan, metadata 2025-12-04T10:03:53.7214143Z 2025-12-04T10:03:53.7214626Z Original Error: IndentationError('expected an indented block after function definition on line 3', ('', 9, 0, '_._ = None\n', 9, -1)) 2025-12-04T10:03:53.7215111Z 2025-12-04T10:03:53.7215242Z _._ = None 2025-12-04T10:03:53.7215390Z ^ 2025-12-04T10:03:53.7215543Z warnings.warn(msg) 2025-12-04T10:03:53.7215712Z 2025-12-04T10:03:53.7215918Z --- Parse Warning: 14 / 17 --- 2025-12-04T10:03:53.7216717Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=LoadPlanner in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/checkpoint/planner.py line=305. 2025-12-04T10:03:53.7217590Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:53.7217939Z 2025-12-04T10:03:53.7218209Z Abstract class defining the protocol used by load_state_dict to plan the load process. 2025-12-04T10:03:53.7218533Z 2025-12-04T10:03:53.7218802Z LoadPlanner are stateful objects that can be used to customize the whole load process. 2025-12-04T10:03:53.7219185Z 2025-12-04T10:03:53.7219440Z LoadPlanner acts as an access proxy to the state_dict, so any transformation done to it 2025-12-04T10:03:53.7219781Z will be visible to the whole process. 2025-12-04T10:03:53.7219986Z 2025-12-04T10:03:53.7220243Z A planner subclass can expect the following sequence of calls during load_state_dict: 2025-12-04T10:03:53.7220562Z 2025-12-04T10:03:53.7220718Z 1) set_up_planner - called on all ranks. 2025-12-04T10:03:53.7220962Z Signals the start of loading a checkpoint. 2025-12-04T10:03:53.7221174Z 2025-12-04T10:03:53.7221338Z 2) create_local_plan - called on all ranks. 2025-12-04T10:03:53.7221694Z Process the state_dict and produces a `LoadPlan` that will be sent for global planning. 2025-12-04T10:03:53.7222017Z 2025-12-04T10:03:53.7222228Z 3) create_global_plan - called on the coordinator rank only. 2025-12-04T10:03:53.7222573Z Takes the LoadPlan from all ranks and make any global decision. 2025-12-04T10:03:53.7222888Z 2025-12-04T10:03:53.7223072Z 4) load_bytes - called multiple times on each rank 2025-12-04T10:03:53.7223358Z This is called once per non-tensor value in state_dict. 2025-12-04T10:03:53.7223414Z 2025-12-04T10:03:53.7223568Z 5) resolve_tensor and commit_tensor - called multiple times on each rank 2025-12-04T10:03:53.7223690Z They are called in pair for each Tensor value in state_dict. 2025-12-04T10:03:53.7223743Z 2025-12-04T10:03:53.7223952Z Users are recommended to extend DefaultLoadPlanner instead of this interface directly as 2025-12-04T10:03:53.7224074Z most changes can be expressed by changes in a single method. 2025-12-04T10:03:53.7224125Z 2025-12-04T10:03:53.7224219Z There are two usual patterns of extension: 2025-12-04T10:03:53.7224272Z 2025-12-04T10:03:53.7224448Z Rewriting state_dict. This is the simplest way to extend the load process as it 2025-12-04T10:03:53.7224629Z doesn't requite understanding the intrincacies of how LoadPlan works. We need 2025-12-04T10:03:53.7224777Z to keep a reference to the original state_dict as load happens in place so 2025-12-04T10:03:53.7224877Z we need to be able to perform it in place 2025-12-04T10:03:53.7224930Z 2025-12-04T10:03:53.7225011Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T10:03:53.7225107Z >>> class RenamePlanner(DefaultLoadPlanner): 2025-12-04T10:03:53.7225175Z >>> def set_up_planner( 2025-12-04T10:03:53.7225235Z >>> self, 2025-12-04T10:03:53.7225315Z >>> state_dict: STATE_DICT_TYPE, 2025-12-04T10:03:53.7225383Z >>> metadata: Metadata, 2025-12-04T10:03:53.7225456Z >>> is_coordinator: bool, 2025-12-04T10:03:53.7225521Z >>> ) -> None: 2025-12-04T10:03:53.7225607Z >>> self.original_state_dict = state_dict 2025-12-04T10:03:53.7225784Z >>> state_dict = {"foo_" + k: v for k, v in state_dict.items()} 2025-12-04T10:03:53.7225845Z >>> 2025-12-04T10:03:53.7225926Z >>> if self.flatten_sharded_tensors: 2025-12-04T10:03:53.7226037Z >>> state_dict = _flatten_sharded_tensors(state_dict) 2025-12-04T10:03:53.7226091Z >>> 2025-12-04T10:03:53.7226165Z >>> if self.flatten_state_dict: 2025-12-04T10:03:53.7226293Z >>> state_dict, self.mappings = flatten_state_dict(state_dict) 2025-12-04T10:03:53.7226346Z >>> 2025-12-04T10:03:53.7226420Z >>> self.state_dict = state_dict 2025-12-04T10:03:53.7226511Z >>> self.metadata = metadata 2025-12-04T10:03:53.7226599Z >>> self.is_coordinator = is_coordinator 2025-12-04T10:03:53.7226658Z >>> 2025-12-04T10:03:53.7226785Z >>> def load_bytes(self, read_item, value): 2025-12-04T10:03:53.7226859Z >>> # Remove the "foo_" prefix 2025-12-04T10:03:53.7227082Z >>> self.original_state_dict[read_item.dest_index.fqn[4:]] = torch.load(value, weights_only=False) 2025-12-04T10:03:53.7227172Z 2025-12-04T10:03:53.7227279Z 2025-12-04T10:03:53.7227460Z Modifying resolve_tensor and commit_tensor to handle load time transformation. 2025-12-04T10:03:53.7227512Z 2025-12-04T10:03:53.7227589Z >>> # xdoctest: +SKIP("undefined vars") 2025-12-04T10:03:53.7227705Z >>> class MetaModelMaterialize(DefaultSavePlanner): 2025-12-04T10:03:53.7227787Z >>> def resolve_tensor(self, read_item): 2025-12-04T10:03:53.7227884Z >>> tensor = super().resolve_tensor(read_item) 2025-12-04T10:03:53.7227986Z >>> return torch.empty_like(tensor, device="cpu") 2025-12-04T10:03:53.7228039Z >>> 2025-12-04T10:03:53.7228132Z >>> def commit_tensor(self, read_item, tensor): 2025-12-04T10:03:53.7228242Z >>> self.state_dict[read_item.dest_index.fqn] = tensor 2025-12-04T10:03:53.7228295Z 2025-12-04T10:03:53.7228667Z Original Error: IndentationError('expected an indented block after function definition on line 22', ('', 23, 0, '_._ = None\n', 23, -1)) 2025-12-04T10:03:53.7228766Z 2025-12-04T10:03:53.7228823Z _._ = None 2025-12-04T10:03:53.7228882Z ^ 2025-12-04T10:03:53.7228950Z warnings.warn(msg) 2025-12-04T10:03:53.7229003Z 2025-12-04T10:03:53.7229154Z --- Parse Warning: 15 / 17 --- 2025-12-04T10:03:53.7229790Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=FullStateDictConfig in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/distributed/fsdp/api.py line=295. 2025-12-04T10:03:53.7229960Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:53.7230013Z 2025-12-04T10:03:53.7230153Z ``FullStateDictConfig`` is a config class meant to be used with 2025-12-04T10:03:53.7230291Z ``StateDictType.FULL_STATE_DICT``. We recommend enabling both 2025-12-04T10:03:53.7230429Z ``offload_to_cpu=True`` and ``rank0_only=True`` when saving full state 2025-12-04T10:03:53.7230588Z dicts to save GPU memory and CPU memory, respectively. This config class 2025-12-04T10:03:53.7230722Z is meant to be used via the :func:`state_dict_type` context manager as 2025-12-04T10:03:53.7230781Z follows: 2025-12-04T10:03:53.7230846Z 2025-12-04T10:03:53.7230933Z >>> # xdoctest: +SKIP("undefined variables") 2025-12-04T10:03:53.7231094Z >>> from torch.distributed.fsdp import FullyShardedDataParallel as FSDP 2025-12-04T10:03:53.7231187Z >>> fsdp = FSDP(model, auto_wrap_policy=...) 2025-12-04T10:03:53.7231326Z >>> cfg = FullStateDictConfig(offload_to_cpu=True, rank0_only=True) 2025-12-04T10:03:53.7231473Z >>> with FSDP.state_dict_type(fsdp, StateDictType.FULL_STATE_DICT, cfg): 2025-12-04T10:03:53.7231555Z >>> state = fsdp.state_dict() 2025-12-04T10:03:53.7231701Z >>> # `state` will be empty on non rank 0 and contain CPU tensors on rank 0. 2025-12-04T10:03:53.7231921Z >>> # To reload checkpoint for inference, finetuning, transfer learning, etc: 2025-12-04T10:03:53.7232089Z >>> model = model_fn() # Initialize model in preparation for wrapping with FSDP 2025-12-04T10:03:53.7232160Z >>> if dist.get_rank() == 0: 2025-12-04T10:03:53.7232296Z >>> # Load checkpoint only on rank 0 to avoid memory redundancy 2025-12-04T10:03:53.7232392Z >>> state_dict = torch.load("my_checkpoint.pt") 2025-12-04T10:03:53.7232478Z >>> model.load_state_dict(state_dict) 2025-12-04T10:03:53.7232640Z >>> # All ranks initialize FSDP module as usual. `sync_module_states` argument 2025-12-04T10:03:53.7232804Z >>> # communicates loaded checkpoint states from rank 0 to rest of the world. 2025-12-04T10:03:53.7232874Z >>> fsdp = FSDP( 2025-12-04T10:03:53.7232993Z ... model, 2025-12-04T10:03:53.7233086Z ... device_id=torch.cuda.current_device(), 2025-12-04T10:03:53.7233175Z ... auto_wrap_policy=..., 2025-12-04T10:03:53.7233295Z ... sync_module_states=True, 2025-12-04T10:03:53.7233353Z ... ) 2025-12-04T10:03:53.7233506Z >>> # After this point, all ranks have FSDP model with loaded checkpoint. 2025-12-04T10:03:53.7233559Z 2025-12-04T10:03:53.7233616Z Attributes: 2025-12-04T10:03:53.7233757Z rank0_only (bool): If ``True``, then only rank 0 saves the full state 2025-12-04T10:03:53.7233892Z dict, and nonzero ranks save an empty dict. If ``False``, then all 2025-12-04T10:03:53.7234005Z ranks save the full state dict. (Default: ``False``) 2025-12-04T10:03:53.7234056Z 2025-12-04T10:03:53.7234392Z Original Error: IndentationError("expected an indented block after 'if' statement on line 10", ('', 11, 1, '_._ = None\n', 11, 2)) 2025-12-04T10:03:53.7234452Z 2025-12-04T10:03:53.7234507Z _._ = None 2025-12-04T10:03:53.7234561Z ^ 2025-12-04T10:03:53.7234632Z warnings.warn(msg) 2025-12-04T10:03:53.7234687Z 2025-12-04T10:03:53.7234819Z --- Parse Warning: 16 / 17 --- 2025-12-04T10:03:53.7235511Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=register_parametrization in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/parametrize.py line=437. 2025-12-04T10:03:53.7235676Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:53.7235785Z Register a parametrization to a tensor in a module. 2025-12-04T10:03:53.7235837Z 2025-12-04T10:03:53.7236021Z Assume that ``tensor_name="weight"`` for simplicity. When accessing ``module.weight``, 2025-12-04T10:03:53.7236217Z the module will return the parametrized version ``parametrization(module.weight)``. 2025-12-04T10:03:53.7236392Z If the original tensor requires a gradient, the backward pass will differentiate 2025-12-04T10:03:53.7236591Z through :attr:`parametrization`, and the optimizer will update the tensor accordingly. 2025-12-04T10:03:53.7236648Z 2025-12-04T10:03:53.7236871Z The first time that a module registers a parametrization, this function will add an attribute 2025-12-04T10:03:53.7237043Z ``parametrizations`` to the module of type :class:`~ParametrizationList`. 2025-12-04T10:03:53.7237095Z 2025-12-04T10:03:53.7237265Z The list of parametrizations on the tensor ``weight`` will be accessible under 2025-12-04T10:03:53.7237364Z ``module.parametrizations.weight``. 2025-12-04T10:03:53.7237417Z 2025-12-04T10:03:53.7237508Z The original tensor will be accessible under 2025-12-04T10:03:53.7237618Z ``module.parametrizations.weight.original``. 2025-12-04T10:03:53.7237671Z 2025-12-04T10:03:53.7237857Z Parametrizations may be concatenated by registering several parametrizations 2025-12-04T10:03:53.7237928Z on the same attribute. 2025-12-04T10:03:53.7237981Z 2025-12-04T10:03:53.7238196Z The training mode of a registered parametrization is updated on registration 2025-12-04T10:03:53.7238296Z to match the training mode of the host module 2025-12-04T10:03:53.7238349Z 2025-12-04T10:03:53.7238559Z Parametrized parameters and buffers have an inbuilt caching system that can be activated 2025-12-04T10:03:53.7238645Z using the context manager :func:`cached`. 2025-12-04T10:03:53.7238697Z 2025-12-04T10:03:53.7238864Z A :attr:`parametrization` may optionally implement a method with signature 2025-12-04T10:03:53.7238918Z 2025-12-04T10:03:53.7239008Z .. code-block:: python 2025-12-04T10:03:53.7239061Z 2025-12-04T10:03:53.7239209Z def right_inverse(self, X: Tensor) -> Union[Tensor, Sequence[Tensor]] 2025-12-04T10:03:53.7239267Z 2025-12-04T10:03:53.7239488Z This method is called on the unparametrized tensor when the first parametrization 2025-12-04T10:03:53.7239634Z is registered to compute the initial value of the original tensor. 2025-12-04T10:03:53.7239879Z If this method is not implemented, the original tensor will be just the unparametrized tensor. 2025-12-04T10:03:53.7239936Z 2025-12-04T10:03:53.7240145Z If all the parametrizations registered on a tensor implement `right_inverse` it is possible 2025-12-04T10:03:53.7240350Z to initialize a parametrized tensor by assigning to it, as shown in the example below. 2025-12-04T10:03:53.7240403Z 2025-12-04T10:03:53.7240559Z It is possible for the first parametrization to depend on several inputs. 2025-12-04T10:03:53.7240722Z This may be implemented returning a tuple of tensors from ``right_inverse`` 2025-12-04T10:03:53.7240878Z (see the example implementation of a ``RankOne`` parametrization below). 2025-12-04T10:03:53.7240938Z 2025-12-04T10:03:53.7241172Z In this case, the unconstrained tensors are also located under ``module.parametrizations.weight`` 2025-12-04T10:03:53.7241263Z with names ``original0``, ``original1``,... 2025-12-04T10:03:53.7241362Z 2025-12-04T10:03:53.7241421Z .. note:: 2025-12-04T10:03:53.7241475Z 2025-12-04T10:03:53.7241661Z If unsafe=False (default) both the forward and right_inverse methods will be called 2025-12-04T10:03:53.7241763Z once to perform a number of consistency checks. 2025-12-04T10:03:53.7241951Z If unsafe=True, then right_inverse will be called if the tensor is not parametrized, 2025-12-04T10:03:53.7242032Z and nothing will be called otherwise. 2025-12-04T10:03:53.7242086Z 2025-12-04T10:03:53.7242149Z .. note:: 2025-12-04T10:03:53.7242207Z 2025-12-04T10:03:53.7242344Z In most situations, ``right_inverse`` will be a function such that 2025-12-04T10:03:53.7242437Z ``forward(right_inverse(X)) == X`` (see 2025-12-04T10:03:53.7242640Z `right inverse `_). 2025-12-04T10:03:53.7242823Z Sometimes, when the parametrization is not surjective, it may be reasonable 2025-12-04T10:03:53.7242892Z to relax this. 2025-12-04T10:03:53.7242958Z 2025-12-04T10:03:53.7243025Z .. warning:: 2025-12-04T10:03:53.7243084Z 2025-12-04T10:03:53.7243271Z If a parametrization depends on several inputs, :func:`~register_parametrization` 2025-12-04T10:03:53.7243455Z will register a number of new parameters. If such parametrization is registered 2025-12-04T10:03:53.7243639Z after the optimizer is created, these new parameters will need to be added manually 2025-12-04T10:03:53.7243779Z to the optimizer. See :meth:`torch.Optimizer.add_param_group`. 2025-12-04T10:03:53.7243836Z 2025-12-04T10:03:53.7243895Z Args: 2025-12-04T10:03:53.7244052Z module (nn.Module): module on which to register the parametrization 2025-12-04T10:03:53.7244198Z tensor_name (str): name of the parameter or buffer on which to register 2025-12-04T10:03:53.7244320Z the parametrization 2025-12-04T10:03:53.7244469Z parametrization (nn.Module): the parametrization to register 2025-12-04T10:03:53.7244531Z Keyword args: 2025-12-04T10:03:53.7244677Z unsafe (bool): a boolean flag that denotes whether the parametrization 2025-12-04T10:03:53.7244815Z may change the dtype and shape of the tensor. Default: `False` 2025-12-04T10:03:53.7244993Z Warning: the parametrization is not checked for consistency upon registration. 2025-12-04T10:03:53.7245080Z Enable this flag at your own risk. 2025-12-04T10:03:53.7245134Z 2025-12-04T10:03:53.7245191Z Raises: 2025-12-04T10:03:53.7245433Z ValueError: if the module does not have a parameter or a buffer named :attr:`tensor_name` 2025-12-04T10:03:53.7245498Z 2025-12-04T10:03:53.7245558Z Examples: 2025-12-04T10:03:53.7245669Z >>> # xdoctest: +REQUIRES(env:TORCH_DOCTEST_LAPACK) 2025-12-04T10:03:53.7245773Z >>> import torch 2025-12-04T10:03:53.7245846Z >>> import torch.nn as nn 2025-12-04T10:03:53.7245948Z >>> import torch.nn.utils.parametrize as P 2025-12-04T10:03:53.7246003Z >>> 2025-12-04T10:03:53.7246082Z >>> class Symmetric(nn.Module): 2025-12-04T10:03:53.7246153Z >>> def forward(self, X): 2025-12-04T10:03:53.7246275Z >>> return X.triu() + X.triu(1).T # Return a symmetric matrix 2025-12-04T10:03:53.7246333Z >>> 2025-12-04T10:03:53.7246411Z >>> def right_inverse(self, A): 2025-12-04T10:03:53.7246479Z >>> return A.triu() 2025-12-04T10:03:53.7246539Z >>> 2025-12-04T10:03:53.7246606Z >>> m = nn.Linear(5, 5) 2025-12-04T10:03:53.7246727Z >>> P.register_parametrization(m, "weight", Symmetric()) 2025-12-04T10:03:53.7246904Z >>> print(torch.allclose(m.weight, m.weight.T)) # m.weight is now symmetric 2025-12-04T10:03:53.7246961Z True 2025-12-04T10:03:53.7247072Z >>> A = torch.rand(5, 5) 2025-12-04T10:03:53.7247144Z >>> A = A + A.T # A is now symmetric 2025-12-04T10:03:53.7247288Z >>> m.weight = A # Initialize the weight to be the symmetric matrix A 2025-12-04T10:03:53.7247378Z >>> print(torch.allclose(m.weight, A)) 2025-12-04T10:03:53.7247433Z True 2025-12-04T10:03:53.7247486Z 2025-12-04T10:03:53.7247562Z >>> class RankOne(nn.Module): 2025-12-04T10:03:53.7247637Z >>> def forward(self, x, y): 2025-12-04T10:03:53.7247735Z >>> # Form a rank 1 matrix multiplying two vectors 2025-12-04T10:03:53.7247835Z >>> return x.unsqueeze(-1) @ y.unsqueeze(-2) 2025-12-04T10:03:53.7247891Z >>> 2025-12-04T10:03:53.7247974Z >>> def right_inverse(self, Z): 2025-12-04T10:03:53.7248055Z >>> # Project Z onto the rank 1 matrices 2025-12-04T10:03:53.7248164Z >>> U, S, Vh = torch.linalg.svd(Z, full_matrices=False) 2025-12-04T10:03:53.7248253Z >>> # Return rescaled singular vectors 2025-12-04T10:03:53.7248337Z >>> s0_sqrt = S[0].sqrt().unsqueeze(-1) 2025-12-04T10:03:53.7248440Z >>> return U[..., :, 0] * s0_sqrt, Vh[..., 0, :] * s0_sqrt 2025-12-04T10:03:53.7248500Z >>> 2025-12-04T10:03:53.7248604Z >>> linear_rank_one = P.register_parametrization( 2025-12-04T10:03:53.7248688Z ... nn.Linear(4, 4), "weight", RankOne() 2025-12-04T10:03:53.7248749Z ... ) 2025-12-04T10:03:53.7248886Z >>> print(torch.linalg.matrix_rank(linear_rank_one.weight).item()) 2025-12-04T10:03:53.7248942Z 1 2025-12-04T10:03:53.7248995Z 2025-12-04T10:03:53.7249051Z 2025-12-04T10:03:53.7249456Z Original Error: IndentationError('expected an indented block after function definition on line 2', ('', 3, 0, '_._ = None\n', 3, -1)) 2025-12-04T10:03:53.7249514Z 2025-12-04T10:03:53.7249571Z _._ = None 2025-12-04T10:03:53.7249630Z ^ 2025-12-04T10:03:53.7249692Z warnings.warn(msg) 2025-12-04T10:03:53.7249746Z 2025-12-04T10:03:53.7249884Z --- Parse Warning: 17 / 17 --- 2025-12-04T10:03:53.7250510Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/xdoctest/core.py:416: UserWarning: Cannot scrape callname=ReduceLROnPlateau in modpath=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/optim/lr_scheduler.py line=1586. 2025-12-04T10:03:53.7250679Z Caused by: DoctestParseError('Failed to parse doctest in _package_groups') 2025-12-04T10:03:53.7250799Z Reduce learning rate when a metric has stopped improving. 2025-12-04T10:03:53.7250848Z 2025-12-04T10:03:53.7251029Z Models often benefit from reducing the learning rate by a factor 2025-12-04T10:03:53.7251157Z of 2-10 once learning stagnates. This scheduler reads a metrics 2025-12-04T10:03:53.7251290Z quantity and if no improvement is seen for a 'patience' number 2025-12-04T10:03:53.7251426Z of epochs, the learning rate is reduced. 2025-12-04T10:03:53.7251479Z 2025-12-04T10:03:53.7251540Z Args: 2025-12-04T10:03:53.7251630Z optimizer (Optimizer): Wrapped optimizer. 2025-12-04T10:03:53.7251737Z mode (str): One of `min`, `max`. In `min` mode, lr will 2025-12-04T10:03:53.7251851Z be reduced when the quantity monitored has stopped 2025-12-04T10:03:53.7251963Z decreasing; in `max` mode it will be reduced when the 2025-12-04T10:03:53.7252090Z quantity monitored has stopped increasing. Default: 'min'. 2025-12-04T10:03:53.7252211Z factor (float): Factor by which the learning rate will be 2025-12-04T10:03:53.7252310Z reduced. new_lr = lr * factor. Default: 0.1. 2025-12-04T10:03:53.7252460Z patience (int): The number of allowed epochs with no improvement after 2025-12-04T10:03:53.7252546Z which the learning rate will be reduced. 2025-12-04T10:03:53.7252746Z For example, consider the case of having no patience (`patience = 0`). 2025-12-04T10:03:53.7252998Z In the first epoch, a baseline is established and is always considered good as there's no previous baseline. 2025-12-04T10:03:53.7253136Z In the second epoch, if the performance is worse than the baseline, 2025-12-04T10:03:53.7253238Z we have what is considered an intolerable epoch. 2025-12-04T10:03:53.7253426Z Since the count of intolerable epochs (1) is greater than the patience level (0), 2025-12-04T10:03:53.7253543Z the learning rate is reduced at the end of this epoch. 2025-12-04T10:03:53.7253768Z From the third epoch onwards, the learning rate continues to be reduced at the end of each epoch 2025-12-04T10:03:53.7253983Z if the performance is worse than the baseline. If the performance improves or remains the same, 2025-12-04T10:03:53.7254068Z the learning rate is not adjusted. 2025-12-04T10:03:53.7254137Z Default: 10. 2025-12-04T10:03:53.7254265Z threshold (float): Threshold for measuring the new optimum, 2025-12-04T10:03:53.7254371Z to only focus on significant changes. Default: 1e-4. 2025-12-04T10:03:53.7254492Z threshold_mode (str): One of `rel`, `abs`. In `rel` mode, 2025-12-04T10:03:53.7254597Z dynamic_threshold = best * ( 1 + threshold ) in 'max' 2025-12-04T10:03:53.7254700Z mode or best * ( 1 - threshold ) in `min` mode. 2025-12-04T10:03:53.7254803Z In `abs` mode, dynamic_threshold = best + threshold in 2025-12-04T10:03:53.7254924Z `max` mode or best - threshold in `min` mode. Default: 'rel'. 2025-12-04T10:03:53.7255046Z cooldown (int): Number of epochs to wait before resuming 2025-12-04T10:03:53.7255407Z normal operation after lr has been reduced. Default: 0. 2025-12-04T10:03:53.7255541Z min_lr (float or list): A scalar or a list of scalars. A 2025-12-04T10:03:53.7255656Z lower bound on the learning rate of all param groups 2025-12-04T10:03:53.7255749Z or each group respectively. Default: 0. 2025-12-04T10:03:53.7255875Z eps (float): Minimal decay applied to lr. If the difference 2025-12-04T10:03:53.7255996Z between new and old lr is smaller than eps, the update is 2025-12-04T10:03:53.7256070Z ignored. Default: 1e-8. 2025-12-04T10:03:53.7256131Z 2025-12-04T10:03:53.7256191Z Example: 2025-12-04T10:03:53.7256260Z >>> # xdoctest: +SKIP 2025-12-04T10:03:53.7256506Z >>> optimizer = torch.optim.SGD(model.parameters(), lr=0.1, momentum=0.9) 2025-12-04T10:03:53.7256619Z >>> scheduler = ReduceLROnPlateau(optimizer, "min") 2025-12-04T10:03:53.7256695Z >>> for epoch in range(10): 2025-12-04T10:03:53.7256763Z >>> train(...) 2025-12-04T10:03:53.7256899Z >>> val_loss = validate(...) 2025-12-04T10:03:53.7257010Z >>> # Note that step should be called after validate() 2025-12-04T10:03:53.7257085Z >>> scheduler.step(val_loss) 2025-12-04T10:03:53.7257138Z 2025-12-04T10:03:53.7257275Z .. image:: ../scripts/lr_scheduler_images/ReduceLROnPlateau.png 2025-12-04T10:03:53.7257329Z 2025-12-04T10:03:53.7257616Z Original Error: IndentationError('unexpected indent', ('', 8, 4, ' scheduler.step(val_loss)\n', 8, -1)) 2025-12-04T10:03:53.7257679Z 2025-12-04T10:03:53.7257753Z scheduler.step(val_loss) 2025-12-04T10:03:53.7257806Z ^ 2025-12-04T10:03:53.7257877Z warnings.warn(msg) 2025-12-04T10:03:53.7257933Z 2025-12-04T10:03:53.7258034Z  2025-12-04T10:03:53.7258159Z === Found 10 run-time warnings === 2025-12-04T10:03:53.7258285Z --- Runtime Warning: 1 / 10 --- 2025-12-04T10:03:53.7258435Z example = 2025-12-04T10:03:53.7259484Z :3: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() 2025-12-04T10:03:53.7259544Z 2025-12-04T10:03:53.7259668Z --- Runtime Warning: 2 / 10 --- 2025-12-04T10:03:53.7259858Z example = 2025-12-04T10:03:53.7260772Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/_tensor.py:1392: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /var/lib/jenkins/workspace/c10/core/TensorImpl.h:1973.) 2025-12-04T10:03:53.7260855Z return super().refine_names(names) 2025-12-04T10:03:53.7260913Z 2025-12-04T10:03:53.7261037Z --- Runtime Warning: 3 / 10 --- 2025-12-04T10:03:53.7261250Z example = 2025-12-04T10:03:53.7261675Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/library.py:275: UserWarning: Warning only once for all operators, other operators may also be overridden. 2025-12-04T10:03:53.7261887Z Overriding a previously registered kernel for the same operator and the same dispatch key 2025-12-04T10:03:53.7262029Z operator: aten::div.Tensor(Tensor self, Tensor other) -> Tensor 2025-12-04T10:03:53.7262232Z registered at /var/lib/jenkins/workspace/build/aten/src/ATen/RegisterSchema.cpp:6 2025-12-04T10:03:53.7262301Z dispatch key: CPU 2025-12-04T10:03:53.7262663Z previous kernel: registered at /var/lib/jenkins/workspace/aten/src/ATen/LegacyBatchingRegistrations.cpp:1079 2025-12-04T10:03:53.7263343Z new kernel: registered at :1 (Triggered internally at /var/lib/jenkins/workspace/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.) 2025-12-04T10:03:53.7263460Z impl_fn(self.ns, name.split("::")[-1], dispatch_key) 2025-12-04T10:03:53.7263514Z 2025-12-04T10:03:53.7263640Z --- Runtime Warning: 4 / 10 --- 2025-12-04T10:03:53.7263809Z example = 2025-12-04T10:03:53.7265082Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nested/__init__.py:117: UserWarning: The PyTorch API of nested tensors is in prototype stage and will change in the near future. We recommend specifying layout=torch.jagged when constructing a nested tensor, as this layout receives active development, has better operator coverage, and works with torch.compile. (Triggered internally at /var/lib/jenkins/workspace/aten/src/ATen/NestedTensorImpl.cpp:178.) 2025-12-04T10:03:53.7265297Z return torch._nested_tensor_from_tensor_list(ts, dtype, None, device, None) 2025-12-04T10:03:53.7265351Z 2025-12-04T10:03:53.7265473Z --- Runtime Warning: 5 / 10 --- 2025-12-04T10:03:53.7265661Z example = 2025-12-04T10:03:53.7266747Z :1: UserWarning: Sparse CSR tensor support is in beta state. If you miss a functionality in the sparse tensor support, please submit a feature request to https://github.com/pytorch/pytorch/issues. (Triggered internally at /var/lib/jenkins/workspace/aten/src/ATen/SparseCsrTensorImpl.cpp:53.) 2025-12-04T10:03:53.7266808Z 2025-12-04T10:03:53.7266932Z --- Runtime Warning: 6 / 10 --- 2025-12-04T10:03:53.7267235Z example = 2025-12-04T10:03:53.7268286Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/fx/experimental/const_fold.py:314: UserWarning: Attempted to insert a get_attr Node with no underlying reference in the owning GraphModule! Call GraphModule.add_submodule to add the necessary submodule, GraphModule.add_parameter to add the necessary Parameter, or nn.Module.register_buffer to add the necessary buffer 2025-12-04T10:03:53.7268401Z new_node = root_const_gm.graph.get_attr(in_node.target) 2025-12-04T10:03:53.7268462Z 2025-12-04T10:03:53.7268588Z --- Runtime Warning: 7 / 10 --- 2025-12-04T10:03:53.7268802Z example = 2025-12-04T10:03:53.7269535Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/transformer.py:144: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance) 2025-12-04T10:03:53.7269622Z self.encoder = TransformerEncoder( 2025-12-04T10:03:53.7269684Z 2025-12-04T10:03:53.7269804Z --- Runtime Warning: 8 / 10 --- 2025-12-04T10:03:53.7270026Z example = 2025-12-04T10:03:53.7270845Z :2: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance) 2025-12-04T10:03:53.7270902Z 2025-12-04T10:03:53.7271027Z --- Runtime Warning: 9 / 10 --- 2025-12-04T10:03:53.7271264Z example = 2025-12-04T10:03:53.7271821Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:144: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`. 2025-12-04T10:03:53.7271907Z WeightNorm.apply(module, name, dim) 2025-12-04T10:03:53.7271960Z 2025-12-04T10:03:53.7272104Z --- Runtime Warning: 10 / 10 --- 2025-12-04T10:03:53.7272317Z example = 2025-12-04T10:03:53.7272864Z /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:144: FutureWarning: `torch.nn.utils.weight_norm` is deprecated in favor of `torch.nn.utils.parametrizations.weight_norm`. 2025-12-04T10:03:53.7272995Z WeightNorm.apply(module, name, dim) 2025-12-04T10:03:53.7273050Z 2025-12-04T10:03:53.7273269Z === 378 passed, 516 skipped, 27 warnings in 15.55 seconds === 2025-12-04T10:03:53.7273428Z Finished doctests 1/1 ... [2025-12-04 10:03:53.696518][1510.330987426], took 0.26min 2025-12-04T10:03:53.7273652Z Running inductor/test_cutlass_backend 1/1 ... [2025-12-04 10:03:53.698786][1510.333265645] 2025-12-04T10:03:53.7273735Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:03:53.7274459Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_cutlass_backend.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:03:53.699116] 2025-12-04T10:03:59.0980304Z 2025-12-04T10:03:59.0981204Z inductor/test_cutlass_backend 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_cutlass_backend_1.1_010b9ec4a3497121_.log 2025-12-04T10:03:59.0981755Z 2025-12-04T10:03:59.0982009Z Finished inductor/test_cutlass_backend 1/1 ... [2025-12-04 10:03:59.097903][1515.732378386], took 0.09min 2025-12-04T10:03:59.1003110Z Running inductor/test_benchmark_fusion 1/1 ... [2025-12-04 10:03:59.100125][1515.734604494] 2025-12-04T10:03:59.1003890Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:03:59.1006753Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_benchmark_fusion.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:03:59.100438] 2025-12-04T10:04:05.4067769Z 2025-12-04T10:04:05.4068874Z inductor/test_benchmark_fusion 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_benchmark_fusion_1.1_785876950c2bc41a_.log 2025-12-04T10:04:05.4099626Z Running 100 items in this shard: test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkFusionGpuTest::test_tield_kernel_fusion_cuda, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code, test/inductor/test_benchmark_fusion.py::BenchmarkMultiTemplateFusionGpuTest::test_equivalent_template_code 2025-12-04T10:04:05.4129538Z 2025-12-04T10:04:05.4129836Z Finished inductor/test_benchmark_fusion 1/1 ... [2025-12-04 10:04:05.406673][1522.041148503], took 0.11min 2025-12-04T10:04:05.4130685Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_benchmark_fusion/inductor.test_benchmark_fusion-74d740f721c794b6.xml 2025-12-04T10:04:05.4815884Z Running inductor/test_distributed_patterns 1/1 ... [2025-12-04 10:04:05.481325][1522.115802954] 2025-12-04T10:04:05.4816418Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:04:05.4819322Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_distributed_patterns.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:04:05.481635] 2025-12-04T10:04:10.9821332Z 2025-12-04T10:04:10.9822292Z inductor/test_distributed_patterns 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_distributed_patterns_1.1_58512cfe279ad5e4_.log 2025-12-04T10:04:10.9823089Z Running 0 items in this shard: 2025-12-04T10:04:10.9823263Z 2025-12-04T10:04:10.9823584Z Finished inductor/test_distributed_patterns 1/1 ... [2025-12-04 10:04:10.981900][1527.616377333], took 0.09min 2025-12-04T10:04:10.9848274Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_distributed_patterns/inductor.test_distributed_patterns-f972528c27d5475e.xml 2025-12-04T10:04:11.0157314Z Running dynamo/test_fake_distributed 1/1 ... [2025-12-04 10:04:11.015508][1527.649987686] 2025-12-04T10:04:11.0157778Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:04:11.0160796Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_fake_distributed.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:04:11.015835] 2025-12-04T10:04:14.0893569Z 2025-12-04T10:04:14.0894490Z dynamo/test_fake_distributed 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_fake_distributed_1.1_a5e8149ee594be6f_.log 2025-12-04T10:04:14.0895276Z Running 0 items in this shard: 2025-12-04T10:04:14.0895446Z 2025-12-04T10:04:14.0895730Z Finished dynamo/test_fake_distributed 1/1 ... [2025-12-04 10:04:14.089137][1530.72361473], took 0.05min 2025-12-04T10:04:14.0919031Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_fake_distributed/dynamo.test_fake_distributed-f18500af782cc14f.xml 2025-12-04T10:04:14.1185685Z Running test_sort_and_select 1/1 ... [2025-12-04 10:04:14.118371][1530.752850139] 2025-12-04T10:04:14.1186096Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:04:14.1189242Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_sort_and_select.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:04:14.118654] 2025-12-04T10:04:17.4844028Z 2025-12-04T10:04:17.4844867Z test_sort_and_select 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_sort_and_select_1.1_2c2b3dd622ee4cd1_.log 2025-12-04T10:04:17.4845585Z Running 0 items in this shard: 2025-12-04T10:04:17.4845765Z 2025-12-04T10:04:17.4846011Z Finished test_sort_and_select 1/1 ... [2025-12-04 10:04:17.484116][1534.118591426], took 0.06min 2025-12-04T10:04:17.4870767Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_sort_and_select/test_sort_and_select-1f54bb39d728015e.xml 2025-12-04T10:04:17.5165515Z Running test_cpp_api_parity 1/1 ... [2025-12-04 10:04:17.516312][1534.15078919] 2025-12-04T10:04:17.5165972Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:04:17.5169436Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cpp_api_parity.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:04:17.516618] 2025-12-04T10:04:25.7802741Z 2025-12-04T10:04:25.7803563Z test_cpp_api_parity 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cpp_api_parity_1.1_fb498b7352b18758_.log 2025-12-04T10:04:25.7804233Z Running 0 items in this shard: 2025-12-04T10:04:25.7804400Z 2025-12-04T10:04:25.7804652Z Finished test_cpp_api_parity 1/1 ... [2025-12-04 10:04:25.780062][1542.414538955], took 0.14min 2025-12-04T10:04:25.7831023Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_cpp_api_parity/test_cpp_api_parity-4ce4457e70f486b9.xml 2025-12-04T10:04:25.8672426Z Running test_extension_utils 1/1 ... [2025-12-04 10:04:25.866971][1542.501450082] 2025-12-04T10:04:25.8672983Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:04:25.8675590Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_extension_utils.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:04:25.867260] 2025-12-04T10:04:28.6870371Z 2025-12-04T10:04:28.6871212Z test_extension_utils 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_extension_utils_1.1_f7778f929dca92b8_.log 2025-12-04T10:04:28.6871887Z Running 0 items in this shard: 2025-12-04T10:04:28.6872068Z 2025-12-04T10:04:28.6872318Z Finished test_extension_utils 1/1 ... [2025-12-04 10:04:28.686806][1545.321283075], took 0.05min 2025-12-04T10:04:28.6900022Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_extension_utils/test_extension_utils-2e393af7d1353d9f.xml 2025-12-04T10:04:28.7172233Z Running test_show_pickle 1/1 ... [2025-12-04 10:04:28.717030][1545.351509515] 2025-12-04T10:04:28.7172597Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:04:28.7175577Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_show_pickle.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:04:28.717314] 2025-12-04T10:04:31.5298941Z 2025-12-04T10:04:31.5299794Z test_show_pickle 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_show_pickle_1.1_d14d40ffea3c45e9_.log 2025-12-04T10:04:31.5300456Z Running 0 items in this shard: 2025-12-04T10:04:31.5300629Z 2025-12-04T10:04:31.5300872Z Finished test_show_pickle 1/1 ... [2025-12-04 10:04:31.529679][1548.164155182], took 0.05min 2025-12-04T10:04:31.5330731Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_show_pickle/test_show_pickle-06dd150985a7f3b0.xml 2025-12-04T10:04:31.5623055Z Running test_torch 1/1 ... [2025-12-04 10:04:31.562071][1548.196549928] 2025-12-04T10:04:31.5623707Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:04:31.5626313Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_torch.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:04:31.562359] 2025-12-04T10:04:50.3825381Z 2025-12-04T10:04:50.3826238Z PRINTING LOG FILE of test_torch 1/1 (test/test-reports/test_torch_1.1_c5508ce831427b28_.log) 2025-12-04T10:04:50.3827104Z Test results will be stored in test-reports/python-pytest/test_torch/test_torch-161156eb485440fd.xml 2025-12-04T10:04:50.3827750Z ============================= test session starts ============================== 2025-12-04T10:04:50.3828542Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T10:04:50.3829017Z cachedir: .pytest_cache 2025-12-04T10:04:50.3829575Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T10:04:50.3830295Z rootdir: /var/lib/jenkins/workspace 2025-12-04T10:04:50.3830567Z configfile: pytest.ini 2025-12-04T10:04:50.3831121Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T10:04:50.3831595Z collecting ... collected 976 items 2025-12-04T10:04:50.3831843Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T10:04:50.3854109Z Running 150 items in this shard: test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_index_add_correctness, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_qengine, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup, test/test_torch.py::TestTorch::test_tensoriterator_output_setup 2025-12-04T10:04:50.3876586Z 2025-12-04T10:04:50.3876772Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.5428s] [ 0%] 2025-12-04T10:04:50.3877431Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0007s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 1%] 2025-12-04T10:04:50.3878458Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3879146Z test_torch.py::TestTorch::test_index_add_correctness FAILED [0.1276s] [ 2%] 2025-12-04T10:04:50.3879536Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2589s] [ 2%] 2025-12-04T10:04:50.3879988Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2667s] [ 2%] 2025-12-04T10:04:50.3880506Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2537s] [ 2%] 2025-12-04T10:04:50.3880901Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2813s] [ 2%] 2025-12-04T10:04:50.3881281Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2561s] [ 2%] 2025-12-04T10:04:50.3881672Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2921s] [ 2%] 2025-12-04T10:04:50.3882064Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2866s] [ 2%] 2025-12-04T10:04:50.3882458Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2672s] [ 2%] 2025-12-04T10:04:50.3882841Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2853s] [ 2%] 2025-12-04T10:04:50.3883231Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2785s] [ 2%] 2025-12-04T10:04:50.3883623Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2419s] [ 2%] 2025-12-04T10:04:50.3884007Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2437s] [ 2%] 2025-12-04T10:04:50.3884407Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2692s] [ 2%] 2025-12-04T10:04:50.3884802Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2511s] [ 2%] 2025-12-04T10:04:50.3885281Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2597s] [ 2%] 2025-12-04T10:04:50.3885680Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2535s] [ 2%] 2025-12-04T10:04:50.3886070Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2542s] [ 2%] 2025-12-04T10:04:50.3886466Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2549s] [ 2%] 2025-12-04T10:04:50.3886853Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2830s] [ 2%] 2025-12-04T10:04:50.3887247Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2524s] [ 2%] 2025-12-04T10:04:50.3887639Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.3099s] [ 2%] 2025-12-04T10:04:50.3888028Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2637s] [ 2%] 2025-12-04T10:04:50.3888472Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2770s] [ 2%] 2025-12-04T10:04:50.3888870Z test_torch.py::TestTorch::test_index_add_correctness FAILED [0.1846s] [ 2%] 2025-12-04T10:04:50.3889319Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2861s] [ 2%] 2025-12-04T10:04:50.3889709Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2520s] [ 2%] 2025-12-04T10:04:50.3890105Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2595s] [ 2%] 2025-12-04T10:04:50.3890513Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2659s] [ 2%] 2025-12-04T10:04:50.3890913Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2559s] [ 2%] 2025-12-04T10:04:50.3891300Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2466s] [ 2%] 2025-12-04T10:04:50.3891694Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2416s] [ 2%] 2025-12-04T10:04:50.3892091Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2834s] [ 2%] 2025-12-04T10:04:50.3892483Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2568s] [ 2%] 2025-12-04T10:04:50.3892872Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2764s] [ 2%] 2025-12-04T10:04:50.3893334Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2574s] [ 2%] 2025-12-04T10:04:50.3893729Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2731s] [ 2%] 2025-12-04T10:04:50.3894115Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2841s] [ 2%] 2025-12-04T10:04:50.3894508Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2738s] [ 2%] 2025-12-04T10:04:50.3894904Z test_torch.py::TestTorch::test_index_add_correctness FAILED [0.1110s] [ 2%] 2025-12-04T10:04:50.3895292Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2374s] [ 2%] 2025-12-04T10:04:50.3895681Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2405s] [ 2%] 2025-12-04T10:04:50.3896073Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2400s] [ 2%] 2025-12-04T10:04:50.3896465Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2422s] [ 2%] 2025-12-04T10:04:50.3896849Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2434s] [ 2%] 2025-12-04T10:04:50.3897236Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2349s] [ 2%] 2025-12-04T10:04:50.3897630Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2478s] [ 2%] 2025-12-04T10:04:50.3898019Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2374s] [ 2%] 2025-12-04T10:04:50.3898407Z test_torch.py::TestTorch::test_index_add_correctness PASSED [0.2406s] [ 2%] 2025-12-04T10:04:50.3899039Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3899898Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3900802Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3901654Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3902507Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3903368Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3904255Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3905107Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3905988Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3906848Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3907756Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3908606Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3909455Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3910352Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3911208Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3912063Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3912910Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3913760Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3914614Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3915466Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3916340Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3917199Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3918043Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3918936Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3919795Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3920643Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3921496Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3922391Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3923245Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3924132Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3924993Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3925843Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3926690Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3927545Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3928434Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3929289Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3930126Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3930976Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3931831Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3932682Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3933536Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3934386Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3935234Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3936086Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3936971Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3937817Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3938664Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3939523Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3940402Z test_torch.py::TestTorch::test_qengine SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3941752Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3942828Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3943807Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3944776Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3945757Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3946743Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3947875Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3948851Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3949828Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3950813Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3951790Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3952761Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3953737Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3954710Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3955980Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3956966Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3957936Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3958936Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3959968Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3960953Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3962023Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3962987Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3963963Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3964939Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3965919Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3966947Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3967921Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3968899Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3969891Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3970870Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3971839Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3972810Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3973779Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3974787Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3975765Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3976728Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3977691Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3978691Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3979665Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3980694Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3981656Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3982622Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3983593Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3984563Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3985579Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3986548Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3987573Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0005s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3988555Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0006s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3989532Z test_torch.py::TestTorch::test_tensoriterator_output_setup SKIPPED [0.0004s] (Test is enabled but --rerun-disabled-tests verification mode is set, so only disabled tests are run) [ 2%] 2025-12-04T10:04:50.3990059Z 2025-12-04T10:04:50.3990163Z =================================== FAILURES =================================== 2025-12-04T10:04:50.3990480Z _____________________ TestTorch.test_index_add_correctness _____________________ 2025-12-04T10:04:50.3990788Z Traceback (most recent call last): 2025-12-04T10:04:50.3991171Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6717, in test_index_add_correctness 2025-12-04T10:04:50.3991557Z helper(dim, dtype, device, size, size) 2025-12-04T10:04:50.3991879Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6708, in helper 2025-12-04T10:04:50.3992241Z self.assertEqual(out, ref_out, atol=1e-2, rtol=1e-2) 2025-12-04T10:04:50.3992796Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:04:50.3993322Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:04:50.3993622Z AssertionError: Tensor-likes are not close! 2025-12-04T10:04:50.3993790Z 2025-12-04T10:04:50.3993868Z Mismatched elements: 1 / 327680 (0.0%) 2025-12-04T10:04:50.3994184Z Greatest absolute difference: 0.0625 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T10:04:50.3994616Z Greatest relative difference: 0.01470947265625 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T10:04:50.3994893Z 2025-12-04T10:04:50.3995023Z To execute this test, run the following from the base repo dir: 2025-12-04T10:04:50.3995371Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T10:04:50.3995582Z 2025-12-04T10:04:50.3996624Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:04:50.3997025Z _____________________ TestTorch.test_index_add_correctness _____________________ 2025-12-04T10:04:50.3997332Z Traceback (most recent call last): 2025-12-04T10:04:50.3997735Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6717, in test_index_add_correctness 2025-12-04T10:04:50.3998113Z helper(dim, dtype, device, size, size) 2025-12-04T10:04:50.3998442Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6708, in helper 2025-12-04T10:04:50.3998805Z self.assertEqual(out, ref_out, atol=1e-2, rtol=1e-2) 2025-12-04T10:04:50.3999313Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:04:50.3999828Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:04:50.4000123Z AssertionError: Tensor-likes are not close! 2025-12-04T10:04:50.4000293Z 2025-12-04T10:04:50.4000379Z Mismatched elements: 1 / 262144 (0.0%) 2025-12-04T10:04:50.4000700Z Greatest absolute difference: 0.03125 at index (1, 305, 250) (up to 0.01 allowed) 2025-12-04T10:04:50.4001148Z Greatest relative difference: 0.01495361328125 at index (1, 305, 250) (up to 0.01 allowed) 2025-12-04T10:04:50.4001471Z 2025-12-04T10:04:50.4001599Z To execute this test, run the following from the base repo dir: 2025-12-04T10:04:50.4001940Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T10:04:50.4002153Z 2025-12-04T10:04:50.4002311Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:04:50.4002692Z _____________________ TestTorch.test_index_add_correctness _____________________ 2025-12-04T10:04:50.4002990Z Traceback (most recent call last): 2025-12-04T10:04:50.4003345Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6717, in test_index_add_correctness 2025-12-04T10:04:50.4003728Z helper(dim, dtype, device, size, size) 2025-12-04T10:04:50.4004042Z File "/var/lib/jenkins/workspace/test/test_torch.py", line 6708, in helper 2025-12-04T10:04:50.4004394Z self.assertEqual(out, ref_out, atol=1e-2, rtol=1e-2) 2025-12-04T10:04:50.4004884Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 4284, in assertEqual 2025-12-04T10:04:50.4005401Z raise error_metas.pop()[0].to_error( # type: ignore[index] 2025-12-04T10:04:50.4005704Z AssertionError: Tensor-likes are not close! 2025-12-04T10:04:50.4005866Z 2025-12-04T10:04:50.4005946Z Mismatched elements: 1 / 262144 (0.0%) 2025-12-04T10:04:50.4006263Z Greatest absolute difference: 0.046875 at index (1, 197, 130) (up to 0.01 allowed) 2025-12-04T10:04:50.4006712Z Greatest relative difference: 0.01324462890625 at index (1, 197, 130) (up to 0.01 allowed) 2025-12-04T10:04:50.4006982Z 2025-12-04T10:04:50.4007116Z To execute this test, run the following from the base repo dir: 2025-12-04T10:04:50.4007456Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T10:04:50.4007672Z 2025-12-04T10:04:50.4007834Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:04:50.4008452Z - generated xml file: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_torch/test_torch-161156eb485440fd.xml - 2025-12-04T10:04:50.4008977Z =========================== short test summary info ============================ 2025-12-04T10:04:50.4009427Z FAILED [0.1276s] test_torch.py::TestTorch::test_index_add_correctness - AssertionError: Tensor-likes are not close! 2025-12-04T10:04:50.4009778Z 2025-12-04T10:04:50.4009852Z Mismatched elements: 1 / 327680 (0.0%) 2025-12-04T10:04:50.4010166Z Greatest absolute difference: 0.0625 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T10:04:50.4010618Z Greatest relative difference: 0.01470947265625 at index (4, 120, 82) (up to 0.01 allowed) 2025-12-04T10:04:50.4010896Z 2025-12-04T10:04:50.4011023Z To execute this test, run the following from the base repo dir: 2025-12-04T10:04:50.4011409Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T10:04:50.4011621Z 2025-12-04T10:04:50.4011788Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:04:50.4012292Z FAILED [0.1846s] test_torch.py::TestTorch::test_index_add_correctness - AssertionError: Tensor-likes are not close! 2025-12-04T10:04:50.4012672Z 2025-12-04T10:04:50.4012746Z Mismatched elements: 1 / 262144 (0.0%) 2025-12-04T10:04:50.4013060Z Greatest absolute difference: 0.03125 at index (1, 305, 250) (up to 0.01 allowed) 2025-12-04T10:04:50.4013505Z Greatest relative difference: 0.01495361328125 at index (1, 305, 250) (up to 0.01 allowed) 2025-12-04T10:04:50.4013773Z 2025-12-04T10:04:50.4013899Z To execute this test, run the following from the base repo dir: 2025-12-04T10:04:50.4014238Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T10:04:50.4014452Z 2025-12-04T10:04:50.4014609Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:04:50.4015122Z FAILED [0.1110s] test_torch.py::TestTorch::test_index_add_correctness - AssertionError: Tensor-likes are not close! 2025-12-04T10:04:50.4015463Z 2025-12-04T10:04:50.4015540Z Mismatched elements: 1 / 262144 (0.0%) 2025-12-04T10:04:50.4015904Z Greatest absolute difference: 0.046875 at index (1, 197, 130) (up to 0.01 allowed) 2025-12-04T10:04:50.4016344Z Greatest relative difference: 0.01324462890625 at index (1, 197, 130) (up to 0.01 allowed) 2025-12-04T10:04:50.4016613Z 2025-12-04T10:04:50.4016747Z To execute this test, run the following from the base repo dir: 2025-12-04T10:04:50.4017079Z python test/test_torch.py TestTorch.test_index_add_correctness 2025-12-04T10:04:50.4017294Z 2025-12-04T10:04:50.4017452Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:04:50.4017822Z ================== 3 failed, 47 passed, 100 skipped in 14.45s ================== 2025-12-04T10:04:50.4018022Z 2025-12-04T10:04:50.4018258Z FINISHED PRINTING LOG FILE of test_torch 1/1 (test/test-reports/test_torch_1.1_c5508ce831427b28_.log) 2025-12-04T10:04:50.4018560Z 2025-12-04T10:04:50.4018716Z Finished test_torch 1/1 ... [2025-12-04 10:04:50.382372][1567.016849508], took 0.31min 2025-12-04T10:04:50.4019328Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_torch/test_torch-161156eb485440fd.xml 2025-12-04T10:04:52.1733526Z Uploading logs for 57120265563 to S3 2025-12-04T10:04:52.3464089Z Uploading artifacts took 1.82 seconds 2025-12-04T10:04:52.3464504Z test_torch 1/1 failed! 2025-12-04T10:04:52.3467965Z Running test_tensorexpr 1/1 ... [2025-12-04 10:04:52.346557][1568.981032273] 2025-12-04T10:04:52.3468403Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:04:52.3472485Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_tensorexpr.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:04:52.346985] 2025-12-04T10:04:55.2676904Z 2025-12-04T10:04:55.2677902Z test_tensorexpr 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_tensorexpr_1.1_382c0bca4aee7904_.log 2025-12-04T10:04:55.2678473Z Running 0 items in this shard: 2025-12-04T10:04:55.2678606Z 2025-12-04T10:04:55.2678813Z Finished test_tensorexpr 1/1 ... [2025-12-04 10:04:55.267455][1571.901929392], took 0.05min 2025-12-04T10:04:55.2712934Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_tensorexpr/test_tensorexpr-cc73ec26257e6848.xml 2025-12-04T10:04:55.3010235Z Running test_utils 1/1 ... [2025-12-04 10:04:55.300807][1571.935284011] 2025-12-04T10:04:55.3010630Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:04:55.3013807Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_utils.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:04:55.301124] 2025-12-04T10:05:07.8221698Z 2025-12-04T10:05:07.8222437Z test_utils 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_utils_1.1_17124a5ce703c95e_.log 2025-12-04T10:05:07.8240302Z Running 100 items in this shard: test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestCheckpoint::test_checkpoint_trigger, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed, test/test_utils.py::TestDataLoaderUtils::test_random_seed 2025-12-04T10:05:07.8256952Z 2025-12-04T10:05:07.8257126Z Finished test_utils 1/1 ... [2025-12-04 10:05:07.822012][1584.456486054], took 0.21min 2025-12-04T10:05:07.8260283Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_utils/test_utils-dc7ffe8b75564894.xml 2025-12-04T10:05:07.9065263Z Running test_namedtuple_return_api 1/1 ... [2025-12-04 10:05:07.906266][1584.540742692] 2025-12-04T10:05:07.9065735Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:05:07.9068938Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_namedtuple_return_api.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:05:07.906589] 2025-12-04T10:05:10.7328457Z 2025-12-04T10:05:10.7329589Z test_namedtuple_return_api 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_namedtuple_return_api_1.1_106189467b589eb1_.log 2025-12-04T10:05:10.7330366Z Running 0 items in this shard: 2025-12-04T10:05:10.7330539Z 2025-12-04T10:05:10.7330820Z Finished test_namedtuple_return_api 1/1 ... [2025-12-04 10:05:10.732622][1587.367097583], took 0.05min 2025-12-04T10:05:10.7366718Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_namedtuple_return_api/test_namedtuple_return_api-0528ea89b6c462b6.xml 2025-12-04T10:05:10.7623052Z Running test_fake_tensor 1/1 ... [2025-12-04 10:05:10.762080][1587.396557624] 2025-12-04T10:05:10.7623458Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:05:10.7626761Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_fake_tensor.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:05:10.762404] 2025-12-04T10:05:17.8534092Z 2025-12-04T10:05:17.8534996Z PRINTING LOG FILE of test_fake_tensor 1/1 (test/test-reports/test_fake_tensor_1.1_e3cb41e76a7ffef1_.log) 2025-12-04T10:05:17.8535786Z Test results will be stored in test-reports/python-pytest/test_fake_tensor/test_fake_tensor-541627ef745602ac.xml 2025-12-04T10:05:17.8536368Z ============================= test session starts ============================== 2025-12-04T10:05:17.8536880Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T10:05:17.8537337Z cachedir: .pytest_cache 2025-12-04T10:05:17.8537871Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T10:05:17.8538452Z rootdir: /var/lib/jenkins/workspace 2025-12-04T10:05:17.8538735Z configfile: pytest.ini 2025-12-04T10:05:17.8539284Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T10:05:17.8539788Z collecting ... collected 288 items 2025-12-04T10:05:17.8540134Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T10:05:17.8550689Z Running 50 items in this shard: test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to, test/test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to 2025-12-04T10:05:17.8561306Z 2025-12-04T10:05:17.8561530Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to FAILED [0.0315s] [ 2%] 2025-12-04T10:05:17.8562023Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to FAILED [0.0081s] [ 2%] 2025-12-04T10:05:17.8562505Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0081s] [ 2%] 2025-12-04T10:05:17.8563066Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8563546Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8564024Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8564500Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8564975Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8565454Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8565934Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8566422Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8566899Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8567378Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8567849Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8568328Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8568805Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8569280Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8569758Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8570252Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8570800Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8571282Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8571759Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8572238Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0080s] [ 2%] 2025-12-04T10:05:17.8572714Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8573367Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8573860Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8574438Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8574937Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8575413Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8575960Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8576444Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8576917Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8577401Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8577881Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8578363Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8578839Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8579323Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8579840Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8580316Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8580786Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8581266Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8581741Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8582219Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8582695Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8583171Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8583651Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8584125Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0079s] [ 2%] 2025-12-04T10:05:17.8584600Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8585091Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0078s] [ 2%] 2025-12-04T10:05:17.8585570Z test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to PASSED [0.0077s] [ 2%] 2025-12-04T10:05:17.8585846Z 2025-12-04T10:05:17.8585940Z =================================== FAILURES =================================== 2025-12-04T10:05:17.8586274Z _________________ FakeTensorOperatorInvariants.test_module_to __________________ 2025-12-04T10:05:17.8586595Z Traceback (most recent call last): 2025-12-04T10:05:17.8587011Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 986, in _apply 2025-12-04T10:05:17.8587579Z torch.utils.swap_tensors(param, param_applied) 2025-12-04T10:05:17.8588022Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/__init__.py", line 45, in swap_tensors 2025-12-04T10:05:17.8588529Z raise RuntimeError("Cannot swap t1 because it has weakref associated with it") 2025-12-04T10:05:17.8588929Z RuntimeError: Cannot swap t1 because it has weakref associated with it 2025-12-04T10:05:17.8589162Z 2025-12-04T10:05:17.8589315Z The above exception was the direct cause of the following exception: 2025-12-04T10:05:17.8589546Z 2025-12-04T10:05:17.8589628Z Traceback (most recent call last): 2025-12-04T10:05:17.8589977Z File "/var/lib/jenkins/workspace/test/test_fake_tensor.py", line 1665, in test_module_to 2025-12-04T10:05:17.8590321Z m.to("cuda") 2025-12-04T10:05:17.8590712Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1383, in to 2025-12-04T10:05:17.8591109Z return self._apply(convert) 2025-12-04T10:05:17.8591490Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 990, in _apply 2025-12-04T10:05:17.8591940Z raise RuntimeError( 2025-12-04T10:05:17.8592162Z RuntimeError: _apply(): Couldn't swap Linear.weight 2025-12-04T10:05:17.8592343Z 2025-12-04T10:05:17.8592476Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:17.8592853Z python test/test_fake_tensor.py FakeTensorOperatorInvariants.test_module_to 2025-12-04T10:05:17.8593110Z 2025-12-04T10:05:17.8593271Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:17.8593675Z _________________ FakeTensorOperatorInvariants.test_module_to __________________ 2025-12-04T10:05:17.8593992Z Traceback (most recent call last): 2025-12-04T10:05:17.8594389Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 986, in _apply 2025-12-04T10:05:17.8594825Z torch.utils.swap_tensors(param, param_applied) 2025-12-04T10:05:17.8595260Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/__init__.py", line 47, in swap_tensors 2025-12-04T10:05:17.8595803Z raise RuntimeError("Cannot swap t2 because it has weakref associated with it") 2025-12-04T10:05:17.8596212Z RuntimeError: Cannot swap t2 because it has weakref associated with it 2025-12-04T10:05:17.8596447Z 2025-12-04T10:05:17.8596595Z The above exception was the direct cause of the following exception: 2025-12-04T10:05:17.8596817Z 2025-12-04T10:05:17.8596911Z Traceback (most recent call last): 2025-12-04T10:05:17.8597251Z File "/var/lib/jenkins/workspace/test/test_fake_tensor.py", line 1665, in test_module_to 2025-12-04T10:05:17.8597593Z m.to("cuda") 2025-12-04T10:05:17.8597939Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1383, in to 2025-12-04T10:05:17.8598322Z return self._apply(convert) 2025-12-04T10:05:17.8598703Z File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/nn/modules/module.py", line 990, in _apply 2025-12-04T10:05:17.8599095Z raise RuntimeError( 2025-12-04T10:05:17.8599305Z RuntimeError: _apply(): Couldn't swap Linear.bias 2025-12-04T10:05:17.8599486Z 2025-12-04T10:05:17.8599611Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:17.8599991Z python test/test_fake_tensor.py FakeTensorOperatorInvariants.test_module_to 2025-12-04T10:05:17.8600240Z 2025-12-04T10:05:17.8600404Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:17.8601001Z - generated xml file: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_fake_tensor/test_fake_tensor-541627ef745602ac.xml - 2025-12-04T10:05:17.8601542Z =========================== short test summary info ============================ 2025-12-04T10:05:17.8602073Z FAILED [0.0315s] test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to - RuntimeError: _apply(): Couldn't swap Linear.weight 2025-12-04T10:05:17.8602478Z 2025-12-04T10:05:17.8602675Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:17.8603056Z python test/test_fake_tensor.py FakeTensorOperatorInvariants.test_module_to 2025-12-04T10:05:17.8603319Z 2025-12-04T10:05:17.8603476Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:17.8604029Z FAILED [0.0081s] test_fake_tensor.py::FakeTensorOperatorInvariants::test_module_to - RuntimeError: _apply(): Couldn't swap Linear.bias 2025-12-04T10:05:17.8604423Z 2025-12-04T10:05:17.8604550Z To execute this test, run the following from the base repo dir: 2025-12-04T10:05:17.8604931Z python test/test_fake_tensor.py FakeTensorOperatorInvariants.test_module_to 2025-12-04T10:05:17.8605185Z 2025-12-04T10:05:17.8605390Z This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 2025-12-04T10:05:17.8605733Z ========================= 2 failed, 48 passed in 0.88s ========================= 2025-12-04T10:05:17.8605918Z 2025-12-04T10:05:17.8606187Z FINISHED PRINTING LOG FILE of test_fake_tensor 1/1 (test/test-reports/test_fake_tensor_1.1_e3cb41e76a7ffef1_.log) 2025-12-04T10:05:17.8606556Z 2025-12-04T10:05:17.8606734Z Finished test_fake_tensor 1/1 ... [2025-12-04 10:05:17.853218][1594.487695043], took 0.12min 2025-12-04T10:05:17.8607383Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_fake_tensor/test_fake_tensor-541627ef745602ac.xml 2025-12-04T10:05:19.0842600Z Uploading logs for 57120265563 to S3 2025-12-04T10:05:19.2525487Z Uploading artifacts took 1.32 seconds 2025-12-04T10:05:19.2525828Z test_fake_tensor 1/1 failed! 2025-12-04T10:05:19.2528783Z Running test_multiprocessing 1/1 ... [2025-12-04 10:05:19.252662][1595.887138343] 2025-12-04T10:05:19.2529216Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:05:19.2533064Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_multiprocessing.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:05:19.253056] 2025-12-04T10:10:23.4452703Z 2025-12-04T10:10:23.4453616Z test_multiprocessing 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_multiprocessing_1.1_c396cb0e4a333e9f_.log 2025-12-04T10:10:23.4477048Z Running 100 items in this shard: test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_cuda_variable_sharing, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple, test/test_multiprocessing.py::TestMultiprocessing::test_meta_simple 2025-12-04T10:10:23.4498261Z 2025-12-04T10:10:23.4498472Z Finished test_multiprocessing 1/1 ... [2025-12-04 10:10:23.445131][1900.079606273], took 5.07min 2025-12-04T10:10:23.4504137Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_multiprocessing/test_multiprocessing-59f445c48e82dcaa.xml 2025-12-04T10:10:23.5358428Z Running test_fx 1/1 ... [2025-12-04 10:10:23.535622][1900.170101148] 2025-12-04T10:10:23.5358978Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:10:23.5361840Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_fx.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:10:23.535934] 2025-12-04T10:12:57.5456429Z 2025-12-04T10:12:57.5459393Z test_fx 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_fx_1.1_56e3136b301d1666_.log 2025-12-04T10:12:57.5477385Z Running 100 items in this shard: test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestFX::test_trace_buffer_slice, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7, test/test_fx.py::TestVisionTracing::test_torchvision_models_efficientnet_b7 2025-12-04T10:12:57.5494614Z 2025-12-04T10:12:57.5494926Z Finished test_fx 1/1 ... [2025-12-04 10:12:57.545446][2054.179919603], took 2.57min 2025-12-04T10:12:57.5509520Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_fx/test_fx-8e8ec79e212b88b9.xml 2025-12-04T10:12:57.6466555Z Running test_autograd_fallback 1/1 ... [2025-12-04 10:12:57.646409][2054.280885563] 2025-12-04T10:12:57.6467100Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:12:57.6470231Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_autograd_fallback.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:12:57.646735] 2025-12-04T10:13:00.5014197Z 2025-12-04T10:13:00.5015113Z test_autograd_fallback 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_autograd_fallback_1.1_54a485e9b165ff35_.log 2025-12-04T10:13:00.5015825Z Running 0 items in this shard: 2025-12-04T10:13:00.5016000Z 2025-12-04T10:13:00.5016272Z Finished test_autograd_fallback 1/1 ... [2025-12-04 10:13:00.501212][2057.135685262], took 0.05min 2025-12-04T10:13:00.5058364Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_autograd_fallback/test_autograd_fallback-8bc86f9f976d5210.xml 2025-12-04T10:13:00.5324183Z Running test_autocast 1/1 ... [2025-12-04 10:13:00.532193][2057.166671268] 2025-12-04T10:13:00.5324575Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:13:00.5327884Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_autocast.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:13:00.532521] 2025-12-04T10:13:03.6964171Z 2025-12-04T10:13:03.6964956Z test_autocast 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_autocast_1.1_8067b1042af94705_.log 2025-12-04T10:13:03.6965614Z Running 0 items in this shard: 2025-12-04T10:13:03.6965797Z 2025-12-04T10:13:03.6966018Z Finished test_autocast 1/1 ... [2025-12-04 10:13:03.696199][2060.330673307], took 0.05min 2025-12-04T10:13:03.7019995Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_autocast/test_autocast-260662fd6260d97e.xml 2025-12-04T10:13:03.7276045Z Running test_python_dispatch 1/1 ... [2025-12-04 10:13:03.727374][2060.361851167] 2025-12-04T10:13:03.7276570Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:13:03.7279732Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_python_dispatch.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:13:03.727701] 2025-12-04T10:13:08.1021877Z 2025-12-04T10:13:08.1023001Z test_python_dispatch 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_python_dispatch_1.1_65235bc5900c6671_.log 2025-12-04T10:13:08.1023760Z Running 0 items in this shard: 2025-12-04T10:13:08.1023941Z 2025-12-04T10:13:08.1024199Z Finished test_python_dispatch 1/1 ... [2025-12-04 10:13:08.101945][2064.736417626], took 0.07min 2025-12-04T10:13:08.1071264Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_python_dispatch/test_python_dispatch-ac7034bd8d91ec1a.xml 2025-12-04T10:13:08.1322128Z Running test_jit_disabled 1/1 ... [2025-12-04 10:13:08.131973][2064.766449832] 2025-12-04T10:13:08.1322548Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:13:08.1325935Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_jit_disabled.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:13:08.132304] 2025-12-04T10:13:10.9563937Z 2025-12-04T10:13:10.9564989Z test_jit_disabled 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_jit_disabled_1.1_9acfb14806c2339e_.log 2025-12-04T10:13:10.9565677Z Running 0 items in this shard: 2025-12-04T10:13:10.9565847Z 2025-12-04T10:13:10.9566085Z Finished test_jit_disabled 1/1 ... [2025-12-04 10:13:10.956169][2067.590644287], took 0.05min 2025-12-04T10:13:10.9612970Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_jit_disabled/test_jit_disabled-38a8accee470b174.xml 2025-12-04T10:13:10.9896177Z Running test_cpp_extensions_mtia_backend 1/1 ... [2025-12-04 10:13:10.989411][2067.623888442] 2025-12-04T10:13:10.9896656Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:13:10.9899894Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cpp_extensions_mtia_backend.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:13:10.989731] 2025-12-04T10:13:13.8210989Z 2025-12-04T10:13:13.8211917Z test_cpp_extensions_mtia_backend 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cpp_extensions_mtia_backend_1.1_47e5defbe240dd5e_.log 2025-12-04T10:13:13.8212681Z Running 0 items in this shard: 2025-12-04T10:13:13.8212861Z 2025-12-04T10:13:13.8213153Z Finished test_cpp_extensions_mtia_backend 1/1 ... [2025-12-04 10:13:13.820892][2070.455364542], took 0.05min 2025-12-04T10:13:13.8271297Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_cpp_extensions_mtia_backend/test_cpp_extensions_mtia_backend-c1c0a2e49ca1a379.xml 2025-12-04T10:13:13.8521553Z Running functorch/test_memory_efficient_fusion 1/1 ... [2025-12-04 10:13:13.851924][2070.48640213] 2025-12-04T10:13:13.8522053Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:13:13.8525052Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'functorch/test_memory_efficient_fusion.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:13:13.852250] 2025-12-04T10:13:16.9322575Z 2025-12-04T10:13:16.9323955Z functorch/test_memory_efficient_fusion 1/1 was successful, full logs can be found in artifacts with path test/test-reports/functorch.test_memory_efficient_fusion_1.1_3d166d53ca5578b9_.log 2025-12-04T10:13:16.9325126Z Running 0 items in this shard: 2025-12-04T10:13:16.9325381Z 2025-12-04T10:13:16.9325841Z Finished functorch/test_memory_efficient_fusion 1/1 ... [2025-12-04 10:13:16.932047][2073.566521455], took 0.05min 2025-12-04T10:13:16.9377982Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/functorch.test_memory_efficient_fusion/functorch.test_memory_efficient_fusion-dd393fbc07d99e9e.xml 2025-12-04T10:13:16.9616640Z Running test_tensor_creation_ops 1/1 ... [2025-12-04 10:13:16.961403][2073.595880766] 2025-12-04T10:13:16.9617281Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:13:16.9620380Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_tensor_creation_ops.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:13:16.961736] 2025-12-04T10:13:20.9713344Z 2025-12-04T10:13:20.9714198Z test_tensor_creation_ops 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_tensor_creation_ops_1.1_7a72945e9c8beebc_.log 2025-12-04T10:13:20.9714945Z Running 0 items in this shard: 2025-12-04T10:13:20.9715118Z 2025-12-04T10:13:20.9715652Z Finished test_tensor_creation_ops 1/1 ... [2025-12-04 10:13:20.971067][2077.60554268], took 0.07min 2025-12-04T10:13:20.9767406Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_tensor_creation_ops/test_tensor_creation_ops-09e3e0f157e06752.xml 2025-12-04T10:13:21.0051605Z Running test_cpp_extensions_stream_and_event 1/1 ... [2025-12-04 10:13:21.004921][2077.639398297] 2025-12-04T10:13:21.0052089Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:13:21.0055371Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cpp_extensions_stream_and_event.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:13:21.005260] 2025-12-04T10:13:23.8066908Z 2025-12-04T10:13:23.8068402Z test_cpp_extensions_stream_and_event 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cpp_extensions_stream_and_event_1.1_2d77e458babddace_.log 2025-12-04T10:13:23.8069707Z Running 0 items in this shard: 2025-12-04T10:13:23.8069980Z 2025-12-04T10:13:23.8070467Z Finished test_cpp_extensions_stream_and_event 1/1 ... [2025-12-04 10:13:23.806438][2080.440913902], took 0.05min 2025-12-04T10:13:23.8122167Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_cpp_extensions_stream_and_event/test_cpp_extensions_stream_and_event-cb8aaf0c2b78a127.xml 2025-12-04T10:13:23.8395519Z Running test_dispatch 1/1 ... [2025-12-04 10:13:23.839281][2080.473758418] 2025-12-04T10:13:23.8395940Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:13:23.8398720Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_dispatch.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:13:23.839583] 2025-12-04T10:13:26.7019429Z 2025-12-04T10:13:26.7027874Z test_dispatch 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_dispatch_1.1_fd773829162dcd6a_.log 2025-12-04T10:13:26.7028528Z Running 0 items in this shard: 2025-12-04T10:13:26.7028673Z 2025-12-04T10:13:26.7028862Z Finished test_dispatch 1/1 ... [2025-12-04 10:13:26.701733][2083.336207673], took 0.05min 2025-12-04T10:13:26.7077783Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_dispatch/test_dispatch-fae8bf7b5906c582.xml 2025-12-04T10:13:26.7309774Z Running nn/test_convolution 1/1 ... [2025-12-04 10:13:26.730634][2083.365106764] 2025-12-04T10:13:26.7310197Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:13:26.7313212Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'nn/test_convolution.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:13:26.731072] 2025-12-04T10:13:31.2623748Z 2025-12-04T10:13:31.2624792Z nn/test_convolution 1/1 was successful, full logs can be found in artifacts with path test/test-reports/nn.test_convolution_1.1_2ecf4aa97a43dda6_.log 2025-12-04T10:13:31.2641407Z Running 50 items in this shard: test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda, test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCUDA::test_conv3d_64bit_indexing_cuda 2025-12-04T10:13:31.2655645Z 2025-12-04T10:13:31.2655860Z Finished nn/test_convolution 1/1 ... [2025-12-04 10:13:31.262241][2087.896714554], took 0.08min 2025-12-04T10:13:31.2684562Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/nn.test_convolution/nn.test_convolution-4066c5253990dd79.xml 2025-12-04T10:13:31.3062693Z Running test_cpp_extensions_jit 1/1 ... [2025-12-04 10:13:31.306020][2087.940496736] 2025-12-04T10:13:31.3063122Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:13:31.3065999Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cpp_extensions_jit.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:13:31.306322] 2025-12-04T10:13:34.6588268Z 2025-12-04T10:13:34.6589274Z test_cpp_extensions_jit 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cpp_extensions_jit_1.1_1167a044bc330c16_.log 2025-12-04T10:13:34.6589968Z Running 0 items in this shard: 2025-12-04T10:13:34.6590151Z 2025-12-04T10:13:34.6590419Z Finished test_cpp_extensions_jit 1/1 ... [2025-12-04 10:13:34.658278][2091.292753024], took 0.06min 2025-12-04T10:13:34.6645964Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_cpp_extensions_jit/test_cpp_extensions_jit-1d0408224d2abc94.xml 2025-12-04T10:13:34.6918660Z Running test_nn 1/1 ... [2025-12-04 10:13:34.691641][2091.326118931] 2025-12-04T10:13:34.6919042Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:13:34.6922141Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_nn.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:13:34.691947] 2025-12-04T10:13:41.9977409Z 2025-12-04T10:13:41.9978356Z test_nn 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_nn_1.1_0bfb94cdb04087aa_.log 2025-12-04T10:13:42.0006530Z Running 150 items in this shard: test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_Linear_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_TransformerEncoderLayer_gelu_activation_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32, test/test_nn.py::TestNN::test_Transformer_multilayer_coder_cuda_tf32 2025-12-04T10:13:42.0032006Z 2025-12-04T10:13:42.0032231Z Finished test_nn 1/1 ... [2025-12-04 10:13:41.997708][2098.632180813], took 0.12min 2025-12-04T10:13:42.0041513Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_nn/test_nn-7a49688264af9155.xml 2025-12-04T10:13:42.0948330Z Running test_multiprocessing_spawn 1/1 ... [2025-12-04 10:13:42.094562][2098.729039214] 2025-12-04T10:13:42.0948794Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:13:42.0951743Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_multiprocessing_spawn.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:13:42.094879] 2025-12-04T10:13:49.9562520Z 2025-12-04T10:13:49.9563392Z test_multiprocessing_spawn 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_multiprocessing_spawn_1.1_9fb1f4fec5b6e0d2_.log 2025-12-04T10:13:49.9564147Z Running 0 items in this shard: 2025-12-04T10:13:49.9564348Z 2025-12-04T10:13:49.9564638Z Finished test_multiprocessing_spawn 1/1 ... [2025-12-04 10:13:49.956028][2106.590502247], took 0.13min 2025-12-04T10:13:49.9625563Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_multiprocessing_spawn/test_multiprocessing_spawn-5b6e250b7bbb2ba6.xml 2025-12-04T10:13:50.0407474Z Running nn/test_pooling 1/1 ... [2025-12-04 10:13:50.040506][2106.674984469] 2025-12-04T10:13:50.0407885Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:13:50.0410774Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'nn/test_pooling.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:13:50.040809] 2025-12-04T10:13:53.5270415Z 2025-12-04T10:13:53.5271204Z nn/test_pooling 1/1 was successful, full logs can be found in artifacts with path test/test-reports/nn.test_pooling_1.1_02768dc568b09226_.log 2025-12-04T10:13:53.5271938Z Running 0 items in this shard: 2025-12-04T10:13:53.5272141Z 2025-12-04T10:13:53.5272420Z Finished nn/test_pooling 1/1 ... [2025-12-04 10:13:53.526805][2110.161279386], took 0.06min 2025-12-04T10:13:53.5336079Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/nn.test_pooling/nn.test_pooling-6222189819ddcf1e.xml 2025-12-04T10:13:53.5629208Z Running test_cuda_trace 1/1 ... [2025-12-04 10:13:53.562664][2110.197142614] 2025-12-04T10:13:53.5629612Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:13:53.5632888Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cuda_trace.py', '--shard-id=1', '--num-shards=1', '-v', '--subprocess', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:13:53.563033] 2025-12-04T10:13:56.4012899Z 2025-12-04T10:13:56.4013698Z test_cuda_trace 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cuda_trace_1.1_d047b12230bdbed1_.log 2025-12-04T10:13:56.4014585Z Running 0 items in this shard: 2025-12-04T10:13:56.4014805Z 2025-12-04T10:13:56.4015037Z Finished test_cuda_trace 1/1 ... [2025-12-04 10:13:56.401060][2113.035534081], took 0.05min 2025-12-04T10:13:56.4083330Z Running test_native_mha 1/1 ... [2025-12-04 10:13:56.408092][2113.042569831] 2025-12-04T10:13:56.4083809Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:13:56.4087365Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_native_mha.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:13:56.408476] 2025-12-04T10:13:59.7349406Z 2025-12-04T10:13:59.7350452Z test_native_mha 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_native_mha_1.1_1ef09fc6539df3bd_.log 2025-12-04T10:13:59.7351137Z Running 0 items in this shard: 2025-12-04T10:13:59.7351320Z 2025-12-04T10:13:59.7351555Z Finished test_native_mha 1/1 ... [2025-12-04 10:13:59.734672][2116.369146941], took 0.06min 2025-12-04T10:13:59.7417792Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_native_mha/test_native_mha-bac556999acb8bd6.xml 2025-12-04T10:13:59.8130964Z Running test_cuda_nvml_based_avail 1/1 ... [2025-12-04 10:13:59.812774][2116.447245676] 2025-12-04T10:13:59.8131421Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:13:59.8134354Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cuda_nvml_based_avail.py', '--shard-id=1', '--num-shards=1', '-v', '--subprocess', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:13:59.813145] 2025-12-04T10:14:02.6278661Z 2025-12-04T10:14:02.6279541Z test_cuda_nvml_based_avail 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cuda_nvml_based_avail_1.1_cc197c973db74fb9_.log 2025-12-04T10:14:02.6280306Z Running 0 items in this shard: 2025-12-04T10:14:02.6280752Z 2025-12-04T10:14:02.6281018Z Finished test_cuda_nvml_based_avail 1/1 ... [2025-12-04 10:14:02.627623][2119.262098557], took 0.05min 2025-12-04T10:14:02.6347856Z Running test_mobile_optimizer 1/1 ... [2025-12-04 10:14:02.634539][2119.269016345] 2025-12-04T10:14:02.6348278Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:14:02.6351105Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_mobile_optimizer.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:14:02.634866] 2025-12-04T10:14:06.1056280Z 2025-12-04T10:14:06.1057042Z test_mobile_optimizer 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_mobile_optimizer_1.1_4839ede4d61f3b89_.log 2025-12-04T10:14:06.1079419Z Running 100 items in this shard: test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_hoist_conv_packed_params, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures, test/test_mobile_optimizer.py::TestOptimizer::test_quantized_conv_no_asan_failures 2025-12-04T10:14:06.1101086Z 2025-12-04T10:14:06.1101297Z Finished test_mobile_optimizer 1/1 ... [2025-12-04 10:14:06.105496][2122.739970178], took 0.06min 2025-12-04T10:14:06.1125960Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_mobile_optimizer/test_mobile_optimizer-7cf39d7714d1461e.xml 2025-12-04T10:14:06.1892782Z Running test_cuda_primary_ctx 1/1 ... [2025-12-04 10:14:06.189024][2122.823500854] 2025-12-04T10:14:06.1893215Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:14:06.1896176Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cuda_primary_ctx.py', '--shard-id=1', '--num-shards=1', '-v', '--subprocess', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:14:06.189335] 2025-12-04T10:14:09.3167831Z 2025-12-04T10:14:09.3168725Z test_cuda_primary_ctx 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cuda_primary_ctx_1.1_f78fb2d6a682ee44_.log 2025-12-04T10:14:09.3169418Z Running 0 items in this shard: 2025-12-04T10:14:09.3169842Z 2025-12-04T10:14:09.3170135Z Finished test_cuda_primary_ctx 1/1 ... [2025-12-04 10:14:09.316557][2125.951032536], took 0.05min 2025-12-04T10:14:09.3237909Z Running test_reductions 1/1 ... [2025-12-04 10:14:09.323579][2125.958057797] 2025-12-04T10:14:09.3238312Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:14:09.3241330Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_reductions.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:14:09.323899] 2025-12-04T10:14:19.9591961Z 2025-12-04T10:14:19.9593063Z test_reductions 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_reductions_1.1_474a3edd9482342d_.log 2025-12-04T10:14:19.9593737Z Running 0 items in this shard: 2025-12-04T10:14:19.9593905Z 2025-12-04T10:14:19.9594164Z Finished test_reductions 1/1 ... [2025-12-04 10:14:19.958905][2136.593379156], took 0.18min 2025-12-04T10:14:19.9661708Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_reductions/test_reductions-8fa1f3b895437bdd.xml 2025-12-04T10:14:20.0320469Z Running test_spectral_ops 1/1 ... [2025-12-04 10:14:20.031808][2136.666285552] 2025-12-04T10:14:20.0320891Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:14:20.0323658Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_spectral_ops.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:14:20.032105] 2025-12-04T10:14:24.7449843Z 2025-12-04T10:14:24.7450695Z test_spectral_ops 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_spectral_ops_1.1_68b862ae55c7c6af_.log 2025-12-04T10:14:24.7451405Z Running 0 items in this shard: 2025-12-04T10:14:24.7451604Z 2025-12-04T10:14:24.7451896Z Finished test_spectral_ops 1/1 ... [2025-12-04 10:14:24.744723][2141.379197806], took 0.08min 2025-12-04T10:14:24.7521971Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_spectral_ops/test_spectral_ops-b3af31a20fb8ad2a.xml 2025-12-04T10:14:24.7803219Z Running distributions/test_distributions 1/1 ... [2025-12-04 10:14:24.780076][2141.414553694] 2025-12-04T10:14:24.7803707Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:14:24.7806457Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'distributions/test_distributions.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:14:24.780368] 2025-12-04T10:14:28.6653064Z 2025-12-04T10:14:28.6653998Z distributions/test_distributions 1/1 was successful, full logs can be found in artifacts with path test/test-reports/distributions.test_distributions_1.1_ced7167d7dfd0dab_.log 2025-12-04T10:14:28.6669710Z Running 50 items in this shard: test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample, test/distributions/test_distributions.py::TestDistributions::test_binomial_sample 2025-12-04T10:14:28.6682206Z 2025-12-04T10:14:28.6682458Z Finished distributions/test_distributions 1/1 ... [2025-12-04 10:14:28.665097][2145.299571114], took 0.06min 2025-12-04T10:14:28.6728026Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/distributions.test_distributions/distributions.test_distributions-2a85008eb39e7213.xml 2025-12-04T10:14:28.7100341Z Running test_autoload_disable 1/1 ... [2025-12-04 10:14:28.709785][2145.344260121] 2025-12-04T10:14:29.0004293Z Processing /var/lib/jenkins/workspace/test/cpp_extensions 2025-12-04T10:14:31.8112392Z Preparing metadata (pyproject.toml) ... [?25l- done 2025-12-04T10:14:31.8133348Z [?25hBuilding wheels for collected packages: torch_test_cpp_extension 2025-12-04T10:15:45.8430394Z Building wheel for torch_test_cpp_extension (pyproject.toml) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-12-04T10:15:45.8540075Z [?25h Created wheel for torch_test_cpp_extension: filename=torch_test_cpp_extension-0.0.0-cp310-cp310-linux_x86_64.whl size=13199635 sha256=a3d4ea2812569d22462fcbb7467a2dc9d620362570ca1c54154e67899720ffcf 2025-12-04T10:15:45.8542121Z Stored in directory: /tmp/pip-ephem-wheel-cache-ws1tnv9z/wheels/2b/79/8d/635cf291e138cfea331292ca746c62b61fade208eb55a7e3a1 2025-12-04T10:15:45.8558265Z Successfully built torch_test_cpp_extension 2025-12-04T10:15:46.1587602Z Installing collected packages: torch_test_cpp_extension 2025-12-04T10:15:46.3354310Z Successfully installed torch_test_cpp_extension-0.0.0 2025-12-04T10:15:48.6915440Z 2025-12-04T10:15:48.6915836Z Running tests... 2025-12-04T10:15:48.6916488Z ---------------------------------------------------------------------- 2025-12-04T10:15:48.9720259Z . 2025-12-04T10:15:48.9720614Z ---------------------------------------------------------------------- 2025-12-04T10:15:48.9721303Z Ran 1 test in 0.280s 2025-12-04T10:15:48.9721447Z 2025-12-04T10:15:48.9721517Z OK 2025-12-04T10:15:48.9721621Z 2025-12-04T10:15:48.9721716Z Generating XML reports... 2025-12-04T10:15:49.5486615Z Finished test_autoload_disable 1/1 ... [2025-12-04 10:15:49.548240][2226.182706066], took 1.35min 2025-12-04T10:15:49.5563329Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-unittest/test_autoload/TEST-TestDeviceBackendAutoload-20251204101548.xml 2025-12-04T10:15:49.6344784Z Running test_autoload_enable 1/1 ... [2025-12-04 10:15:49.634217][2226.268692911] 2025-12-04T10:15:49.9409150Z Processing /var/lib/jenkins/workspace/test/cpp_extensions 2025-12-04T10:15:52.7345872Z Preparing metadata (pyproject.toml) ... [?25l- done 2025-12-04T10:15:52.7367575Z [?25hBuilding wheels for collected packages: torch_test_cpp_extension 2025-12-04T10:17:06.4264699Z Building wheel for torch_test_cpp_extension (pyproject.toml) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-12-04T10:17:06.4375440Z [?25h Created wheel for torch_test_cpp_extension: filename=torch_test_cpp_extension-0.0.0-cp310-cp310-linux_x86_64.whl size=13199635 sha256=961315662f91b21c5fa36e452d0a9a635e66c067d4065d2ef5eff44d2c1b8495 2025-12-04T10:17:06.4378310Z Stored in directory: /tmp/pip-ephem-wheel-cache-s7_5w6_8/wheels/2b/79/8d/635cf291e138cfea331292ca746c62b61fade208eb55a7e3a1 2025-12-04T10:17:06.4394947Z Successfully built torch_test_cpp_extension 2025-12-04T10:17:06.7446890Z Installing collected packages: torch_test_cpp_extension 2025-12-04T10:17:06.9264583Z Successfully installed torch_test_cpp_extension-0.0.0 2025-12-04T10:17:09.2666514Z 2025-12-04T10:17:09.2667039Z Running tests... 2025-12-04T10:17:09.2667503Z ---------------------------------------------------------------------- 2025-12-04T10:17:09.5462550Z . 2025-12-04T10:17:09.5462891Z ---------------------------------------------------------------------- 2025-12-04T10:17:09.5463306Z Ran 1 test in 0.280s 2025-12-04T10:17:09.5463457Z 2025-12-04T10:17:09.5463526Z OK 2025-12-04T10:17:09.5463634Z 2025-12-04T10:17:09.5463727Z Generating XML reports... 2025-12-04T10:17:10.1129060Z Finished test_autoload_enable 1/1 ... [2025-12-04 10:17:10.112496][2306.746962134], took 1.34min 2025-12-04T10:17:10.1207079Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-unittest/test_autoload/TEST-TestDeviceBackendAutoload-20251204101709.xml 2025-12-04T10:17:10.2002821Z Running test_cpp_extensions_aot_ninja 1/1 ... [2025-12-04 10:17:10.199992][2306.834469943] 2025-12-04T10:17:10.5322420Z Processing /var/lib/jenkins/workspace/test/cpp_extensions 2025-12-04T10:17:13.3953009Z Preparing metadata (pyproject.toml) ... [?25l- done 2025-12-04T10:17:13.3973910Z [?25hBuilding wheels for collected packages: torch_test_cpp_extension 2025-12-04T10:18:28.5770794Z Building wheel for torch_test_cpp_extension (pyproject.toml) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - \ | / - \ done 2025-12-04T10:18:28.5901915Z [?25h Created wheel for torch_test_cpp_extension: filename=torch_test_cpp_extension-0.0.0-cp310-cp310-linux_x86_64.whl size=16081734 sha256=5d7d29cb8c12b067513ac922e9db82b57713d43aea888d60c52469c54c48f960 2025-12-04T10:18:28.5904766Z Stored in directory: /tmp/pip-ephem-wheel-cache-ut6kmmj6/wheels/2b/79/8d/635cf291e138cfea331292ca746c62b61fade208eb55a7e3a1 2025-12-04T10:18:28.5921889Z Successfully built torch_test_cpp_extension 2025-12-04T10:18:28.8937067Z Installing collected packages: torch_test_cpp_extension 2025-12-04T10:18:29.1126136Z Successfully installed torch_test_cpp_extension-0.0.0 2025-12-04T10:18:29.4498512Z Processing /var/lib/jenkins/workspace/test/cpp_extensions/no_python_abi_suffix_test 2025-12-04T10:18:31.1294126Z Preparing metadata (pyproject.toml) ... [?25l- done 2025-12-04T10:18:31.1316188Z [?25hBuilding wheels for collected packages: no_python_abi_suffix_test 2025-12-04T10:18:33.0740243Z Building wheel for no_python_abi_suffix_test (pyproject.toml) ... [?25l- \ | done 2025-12-04T10:18:33.0748687Z [?25h Created wheel for no_python_abi_suffix_test: filename=no_python_abi_suffix_test-0.0.0-cp310-cp310-linux_x86_64.whl size=2944 sha256=e49ec864a975c5419e7bf858a3d7732c60cb3cf790688a3d450794f6d1c91aa3 2025-12-04T10:18:33.0749927Z Stored in directory: /tmp/pip-ephem-wheel-cache-9klrrjvt/wheels/8c/c7/11/bcf2bfbdebb3cf78b8211ac54acc945a8fdf1732548d147a80 2025-12-04T10:18:33.0768311Z Successfully built no_python_abi_suffix_test 2025-12-04T10:18:33.3813885Z Installing collected packages: no_python_abi_suffix_test 2025-12-04T10:18:33.3929447Z Successfully installed no_python_abi_suffix_test-0.0.0 2025-12-04T10:18:33.4800460Z * Getting build dependencies for wheel... 2025-12-04T10:18:34.8441799Z running egg_info 2025-12-04T10:18:34.8517341Z creating python_agnostic.egg-info 2025-12-04T10:18:34.8519329Z writing python_agnostic.egg-info/PKG-INFO 2025-12-04T10:18:34.8522633Z writing dependency_links to python_agnostic.egg-info/dependency_links.txt 2025-12-04T10:18:34.8525511Z writing top-level names to python_agnostic.egg-info/top_level.txt 2025-12-04T10:18:34.8526931Z writing manifest file 'python_agnostic.egg-info/SOURCES.txt' 2025-12-04T10:18:34.8961395Z reading manifest file 'python_agnostic.egg-info/SOURCES.txt' 2025-12-04T10:18:34.8968704Z writing manifest file 'python_agnostic.egg-info/SOURCES.txt' 2025-12-04T10:18:35.1862138Z * Building wheel... 2025-12-04T10:18:36.5412313Z running bdist_wheel 2025-12-04T10:18:36.5994236Z running build 2025-12-04T10:18:36.5994565Z running build_ext 2025-12-04T10:18:36.6027904Z building 'python_agnostic._C' extension 2025-12-04T10:18:36.6030959Z creating /var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/build/temp.linux-x86_64-cpython-310/var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/python_agnostic/csrc 2025-12-04T10:18:46.8480797Z [1/1] /usr/local/cuda/bin/nvcc --generate-dependencies-with-compile --dependency-output /var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/build/temp.linux-x86_64-cpython-310/var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/python_agnostic/csrc/ultra_norm.o.d -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include -I/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/usr/local/cuda/include -I/opt/conda/envs/py_3.10/include/python3.10 -c -c /var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/python_agnostic/csrc/ultra_norm.cu -o /var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/build/temp.linux-x86_64-cpython-310/var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/python_agnostic/csrc/ultra_norm.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DTORCH_API_INCLUDE_EXTENSION_H -DPy_LIMITED_API=0x030A0000 -DTORCH_EXTENSION_NAME=_C -gencode=arch=compute_89,code=sm_89 -std=c++17 2025-12-04T10:18:46.8546027Z creating build/lib.linux-x86_64-cpython-310/python_agnostic 2025-12-04T10:18:46.8551107Z g++ -pthread -B /opt/conda/envs/py_3.10/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/conda/envs/py_3.10/include -fPIC -O2 -isystem /opt/conda/envs/py_3.10/include -pthread -B /opt/conda/envs/py_3.10/compiler_compat -shared /var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/build/temp.linux-x86_64-cpython-310/var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/python_agnostic/csrc/ultra_norm.o -L/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib -L/usr/local/cuda/lib64 -lc10 -ltorch -ltorch_cpu -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/python_agnostic/_C.so 2025-12-04T10:18:47.5192837Z installing to build/bdist.linux-x86_64/wheel 2025-12-04T10:18:47.5193476Z running install 2025-12-04T10:18:47.5233479Z running install_lib 2025-12-04T10:18:47.5308457Z creating build/bdist.linux-x86_64/wheel 2025-12-04T10:18:47.5310900Z creating build/bdist.linux-x86_64/wheel/python_agnostic 2025-12-04T10:18:47.5312379Z copying build/lib.linux-x86_64-cpython-310/python_agnostic/_C.so -> build/bdist.linux-x86_64/wheel/./python_agnostic 2025-12-04T10:18:47.5318642Z running install_egg_info 2025-12-04T10:18:47.5394951Z running egg_info 2025-12-04T10:18:47.5464871Z writing python_agnostic.egg-info/PKG-INFO 2025-12-04T10:18:47.5468803Z writing dependency_links to python_agnostic.egg-info/dependency_links.txt 2025-12-04T10:18:47.5481513Z writing top-level names to python_agnostic.egg-info/top_level.txt 2025-12-04T10:18:47.5572447Z reading manifest file 'python_agnostic.egg-info/SOURCES.txt' 2025-12-04T10:18:47.5581795Z writing manifest file 'python_agnostic.egg-info/SOURCES.txt' 2025-12-04T10:18:47.5598919Z Copying python_agnostic.egg-info to build/bdist.linux-x86_64/wheel/./python_agnostic-0.0-py3.10.egg-info 2025-12-04T10:18:47.5606315Z running install_scripts 2025-12-04T10:18:47.5717265Z creating build/bdist.linux-x86_64/wheel/python_agnostic-0.0.dist-info/WHEEL 2025-12-04T10:18:47.5722103Z creating '/var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/dist/.tmp-fh7z2deq/python_agnostic-0.0-cp39-abi3-linux_x86_64.whl' and adding 'build/bdist.linux-x86_64/wheel' to it 2025-12-04T10:18:47.5878195Z adding 'python_agnostic/_C.so' 2025-12-04T10:18:47.5884048Z adding 'python_agnostic-0.0.dist-info/METADATA' 2025-12-04T10:18:47.5885289Z adding 'python_agnostic-0.0.dist-info/WHEEL' 2025-12-04T10:18:47.5886881Z adding 'python_agnostic-0.0.dist-info/top_level.txt' 2025-12-04T10:18:47.5888211Z adding 'python_agnostic-0.0.dist-info/RECORD' 2025-12-04T10:18:47.5889124Z removing build/bdist.linux-x86_64/wheel 2025-12-04T10:18:47.8449115Z Successfully built python_agnostic-0.0-cp39-abi3-linux_x86_64.whl 2025-12-04T10:18:48.1501688Z Processing /var/lib/jenkins/workspace/test/cpp_extensions/libtorch_agnostic_2_9_extension 2025-12-04T10:18:49.8438902Z Preparing metadata (pyproject.toml) ... [?25l- done 2025-12-04T10:18:49.8462279Z [?25hRequirement already satisfied: torch in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from libtorch_agnostic_2_9==0.0) (2.10.0a0+gitffd9b0f) 2025-12-04T10:18:49.8485814Z Requirement already satisfied: filelock in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (3.18.0) 2025-12-04T10:18:49.8490645Z Requirement already satisfied: typing-extensions>=4.10.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (4.12.2) 2025-12-04T10:18:49.8495192Z Requirement already satisfied: sympy>=1.13.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (1.13.3) 2025-12-04T10:18:49.8499937Z Requirement already satisfied: networkx>=2.5.1 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (2.8.8) 2025-12-04T10:18:49.8503172Z Requirement already satisfied: jinja2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (3.1.6) 2025-12-04T10:18:49.8507511Z Requirement already satisfied: fsspec>=0.8.5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (2025.10.0) 2025-12-04T10:18:49.8834861Z Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from sympy>=1.13.3->torch->libtorch_agnostic_2_9==0.0) (1.3.0) 2025-12-04T10:18:49.8883071Z Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from jinja2->torch->libtorch_agnostic_2_9==0.0) (3.0.3) 2025-12-04T10:18:49.8892501Z Building wheels for collected packages: libtorch_agnostic_2_9 2025-12-04T10:18:54.9988208Z Building wheel for libtorch_agnostic_2_9 (pyproject.toml) ... [?25l- \ | / done 2025-12-04T10:18:54.9998457Z [?25h Created wheel for libtorch_agnostic_2_9: filename=libtorch_agnostic_2_9-0.0-cp39-abi3-linux_x86_64.whl size=55939 sha256=bfc18d379408bc2625a83df11d4824427914f4f053e2cc45e948a0d7cc613288 2025-12-04T10:18:54.9999710Z Stored in directory: /tmp/pip-ephem-wheel-cache-wwcqysz2/wheels/e1/56/0d/91ac1e918c8015b48f6a77f66abeeb8427a8788f7d37715e0e 2025-12-04T10:18:55.0017679Z Successfully built libtorch_agnostic_2_9 2025-12-04T10:18:55.2733190Z Installing collected packages: libtorch_agnostic_2_9 2025-12-04T10:18:55.2905860Z Successfully installed libtorch_agnostic_2_9-0.0 2025-12-04T10:18:55.6226872Z Processing /var/lib/jenkins/workspace/test/cpp_extensions/libtorch_agnostic_2_10_extension 2025-12-04T10:18:57.3166319Z Preparing metadata (pyproject.toml) ... [?25l- done 2025-12-04T10:18:57.3190930Z [?25hRequirement already satisfied: torch in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from libtorch_agnostic_2_10==0.0) (2.10.0a0+gitffd9b0f) 2025-12-04T10:18:57.3217121Z Requirement already satisfied: filelock in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (3.18.0) 2025-12-04T10:18:57.3221623Z Requirement already satisfied: typing-extensions>=4.10.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (4.12.2) 2025-12-04T10:18:57.3225592Z Requirement already satisfied: sympy>=1.13.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (1.13.3) 2025-12-04T10:18:57.3230276Z Requirement already satisfied: networkx>=2.5.1 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (2.8.8) 2025-12-04T10:18:57.3233408Z Requirement already satisfied: jinja2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (3.1.6) 2025-12-04T10:18:57.3237389Z Requirement already satisfied: fsspec>=0.8.5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (2025.10.0) 2025-12-04T10:18:57.3560833Z Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from sympy>=1.13.3->torch->libtorch_agnostic_2_10==0.0) (1.3.0) 2025-12-04T10:18:57.3609071Z Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from jinja2->torch->libtorch_agnostic_2_10==0.0) (3.0.3) 2025-12-04T10:18:57.3617858Z Building wheels for collected packages: libtorch_agnostic_2_10 2025-12-04T10:19:03.1330504Z Building wheel for libtorch_agnostic_2_10 (pyproject.toml) ... [?25l- \ | / - \ done 2025-12-04T10:19:03.1339891Z [?25h Created wheel for libtorch_agnostic_2_10: filename=libtorch_agnostic_2_10-0.0-cp39-abi3-linux_x86_64.whl size=83393 sha256=a9f15197cd0259e70c8d0b49b5329887c0a39516f9503ff2f52a2a389ac3d275 2025-12-04T10:19:03.1341650Z Stored in directory: /tmp/pip-ephem-wheel-cache-ycqpxopc/wheels/03/17/c4/d9b9dbd12b271a9a317a75e944d0966701385d67eac86f2c1a 2025-12-04T10:19:03.1360715Z Successfully built libtorch_agnostic_2_10 2025-12-04T10:19:03.4091948Z Installing collected packages: libtorch_agnostic_2_10 2025-12-04T10:19:03.4246773Z Successfully installed libtorch_agnostic_2_10-0.0 2025-12-04T10:19:03.4654996Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:19:03.4659012Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cpp_extensions_aot_ninja.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:19:03.465579] 2025-12-04T10:19:06.6300467Z 2025-12-04T10:19:06.6301726Z test_cpp_extensions_aot_ninja 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cpp_extensions_aot_ninja_1.1_26fdd18b2d333d72_.log 2025-12-04T10:19:06.6302538Z Running 0 items in this shard: 2025-12-04T10:19:06.6302715Z 2025-12-04T10:19:06.6303012Z Finished test_cpp_extensions_aot_ninja 1/1 ... [2025-12-04 10:19:06.629891][2423.264365324], took 1.94min 2025-12-04T10:19:06.6378933Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_cpp_extensions_aot_ninja/test_cpp_extensions_aot_ninja-7e3d25a89f42eb08.xml 2025-12-04T10:19:06.7102172Z Running test_cpp_extensions_aot_no_ninja 1/1 ... [2025-12-04 10:19:06.709964][2423.344442] 2025-12-04T10:19:07.0198535Z Processing /var/lib/jenkins/workspace/test/cpp_extensions 2025-12-04T10:19:09.8580411Z Preparing metadata (pyproject.toml) ... [?25l- done 2025-12-04T10:19:09.8602420Z [?25hBuilding wheels for collected packages: torch_test_cpp_extension 2025-12-04T10:20:23.8381555Z Building wheel for torch_test_cpp_extension (pyproject.toml) ... [?25l- \ | / - \ | / - \ | / - \ | / - \ | / - done 2025-12-04T10:20:23.8493830Z [?25h Created wheel for torch_test_cpp_extension: filename=torch_test_cpp_extension-0.0.0-cp310-cp310-linux_x86_64.whl size=13199635 sha256=982956e6b55ca0763c94ac200692ebbe3d7e8ccd4d58ca28b38ff2884b8c72a6 2025-12-04T10:20:23.8496325Z Stored in directory: /tmp/pip-ephem-wheel-cache-s03a098o/wheels/2b/79/8d/635cf291e138cfea331292ca746c62b61fade208eb55a7e3a1 2025-12-04T10:20:23.8513530Z Successfully built torch_test_cpp_extension 2025-12-04T10:20:24.1598876Z Installing collected packages: torch_test_cpp_extension 2025-12-04T10:20:24.3483092Z Successfully installed torch_test_cpp_extension-0.0.0 2025-12-04T10:20:24.6861368Z Processing /var/lib/jenkins/workspace/test/cpp_extensions/no_python_abi_suffix_test 2025-12-04T10:20:26.3745780Z Preparing metadata (pyproject.toml) ... [?25l- done 2025-12-04T10:20:26.3767715Z [?25hBuilding wheels for collected packages: no_python_abi_suffix_test 2025-12-04T10:20:28.2440068Z Building wheel for no_python_abi_suffix_test (pyproject.toml) ... [?25l- \ done 2025-12-04T10:20:28.2448253Z [?25h Created wheel for no_python_abi_suffix_test: filename=no_python_abi_suffix_test-0.0.0-cp310-cp310-linux_x86_64.whl size=2944 sha256=e36d2095b8c71fa1c91be0f3c98219cdd4560a52982f1cb170b468151e38265b 2025-12-04T10:20:28.2449619Z Stored in directory: /tmp/pip-ephem-wheel-cache-4wp37ifk/wheels/8c/c7/11/bcf2bfbdebb3cf78b8211ac54acc945a8fdf1732548d147a80 2025-12-04T10:20:28.2468434Z Successfully built no_python_abi_suffix_test 2025-12-04T10:20:28.5479679Z Installing collected packages: no_python_abi_suffix_test 2025-12-04T10:20:28.5619326Z Successfully installed no_python_abi_suffix_test-0.0.0 2025-12-04T10:20:28.6494325Z * Getting build dependencies for wheel... 2025-12-04T10:20:30.0226644Z running egg_info 2025-12-04T10:20:30.0302778Z writing python_agnostic.egg-info/PKG-INFO 2025-12-04T10:20:30.0306479Z writing dependency_links to python_agnostic.egg-info/dependency_links.txt 2025-12-04T10:20:30.0309664Z writing top-level names to python_agnostic.egg-info/top_level.txt 2025-12-04T10:20:30.0733320Z reading manifest file 'python_agnostic.egg-info/SOURCES.txt' 2025-12-04T10:20:30.0741499Z writing manifest file 'python_agnostic.egg-info/SOURCES.txt' 2025-12-04T10:20:30.3681843Z * Building wheel... 2025-12-04T10:20:31.7209662Z running bdist_wheel 2025-12-04T10:20:31.7787067Z running build 2025-12-04T10:20:31.7787465Z running build_ext 2025-12-04T10:20:31.7820358Z building 'python_agnostic._C' extension 2025-12-04T10:20:31.8484256Z ninja: no work to do. 2025-12-04T10:20:31.8520881Z g++ -pthread -B /opt/conda/envs/py_3.10/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /opt/conda/envs/py_3.10/include -fPIC -O2 -isystem /opt/conda/envs/py_3.10/include -pthread -B /opt/conda/envs/py_3.10/compiler_compat -shared /var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/build/temp.linux-x86_64-cpython-310/var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/python_agnostic/csrc/ultra_norm.o -L/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib -L/usr/local/cuda/lib64 -lc10 -ltorch -ltorch_cpu -lcudart -lc10_cuda -ltorch_cuda -o build/lib.linux-x86_64-cpython-310/python_agnostic/_C.so 2025-12-04T10:20:32.5190792Z installing to build/bdist.linux-x86_64/wheel 2025-12-04T10:20:32.5191178Z running install 2025-12-04T10:20:32.5229244Z running install_lib 2025-12-04T10:20:32.5304092Z creating build/bdist.linux-x86_64/wheel 2025-12-04T10:20:32.5305949Z creating build/bdist.linux-x86_64/wheel/python_agnostic 2025-12-04T10:20:32.5307389Z copying build/lib.linux-x86_64-cpython-310/python_agnostic/_C.so -> build/bdist.linux-x86_64/wheel/./python_agnostic 2025-12-04T10:20:32.5312876Z running install_egg_info 2025-12-04T10:20:32.5387186Z running egg_info 2025-12-04T10:20:32.5455065Z writing python_agnostic.egg-info/PKG-INFO 2025-12-04T10:20:32.5459210Z writing dependency_links to python_agnostic.egg-info/dependency_links.txt 2025-12-04T10:20:32.5462356Z writing top-level names to python_agnostic.egg-info/top_level.txt 2025-12-04T10:20:32.5535082Z reading manifest file 'python_agnostic.egg-info/SOURCES.txt' 2025-12-04T10:20:32.5543432Z writing manifest file 'python_agnostic.egg-info/SOURCES.txt' 2025-12-04T10:20:32.5545228Z Copying python_agnostic.egg-info to build/bdist.linux-x86_64/wheel/./python_agnostic-0.0-py3.10.egg-info 2025-12-04T10:20:32.5552452Z running install_scripts 2025-12-04T10:20:32.5660091Z creating build/bdist.linux-x86_64/wheel/python_agnostic-0.0.dist-info/WHEEL 2025-12-04T10:20:32.5664015Z creating '/var/lib/jenkins/workspace/test/cpp_extensions/python_agnostic_extension/dist/.tmp-owzpev2p/python_agnostic-0.0-cp39-abi3-linux_x86_64.whl' and adding 'build/bdist.linux-x86_64/wheel' to it 2025-12-04T10:20:32.5819427Z adding 'python_agnostic/_C.so' 2025-12-04T10:20:32.5825271Z adding 'python_agnostic-0.0.dist-info/METADATA' 2025-12-04T10:20:32.5826378Z adding 'python_agnostic-0.0.dist-info/WHEEL' 2025-12-04T10:20:32.5827758Z adding 'python_agnostic-0.0.dist-info/top_level.txt' 2025-12-04T10:20:32.5828942Z adding 'python_agnostic-0.0.dist-info/RECORD' 2025-12-04T10:20:32.5829568Z removing build/bdist.linux-x86_64/wheel 2025-12-04T10:20:32.8367957Z Successfully built python_agnostic-0.0-cp39-abi3-linux_x86_64.whl 2025-12-04T10:20:33.1446198Z Processing /var/lib/jenkins/workspace/test/cpp_extensions/libtorch_agnostic_2_9_extension 2025-12-04T10:20:34.8499529Z Preparing metadata (pyproject.toml) ... [?25l- done 2025-12-04T10:20:34.8523212Z [?25hRequirement already satisfied: torch in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from libtorch_agnostic_2_9==0.0) (2.10.0a0+gitffd9b0f) 2025-12-04T10:20:34.8546302Z Requirement already satisfied: filelock in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (3.18.0) 2025-12-04T10:20:34.8551025Z Requirement already satisfied: typing-extensions>=4.10.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (4.12.2) 2025-12-04T10:20:34.8555748Z Requirement already satisfied: sympy>=1.13.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (1.13.3) 2025-12-04T10:20:34.8559793Z Requirement already satisfied: networkx>=2.5.1 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (2.8.8) 2025-12-04T10:20:34.8562804Z Requirement already satisfied: jinja2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (3.1.6) 2025-12-04T10:20:34.8566985Z Requirement already satisfied: fsspec>=0.8.5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_9==0.0) (2025.10.0) 2025-12-04T10:20:34.8894510Z Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from sympy>=1.13.3->torch->libtorch_agnostic_2_9==0.0) (1.3.0) 2025-12-04T10:20:34.8942343Z Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from jinja2->torch->libtorch_agnostic_2_9==0.0) (3.0.3) 2025-12-04T10:20:34.8950844Z Building wheels for collected packages: libtorch_agnostic_2_9 2025-12-04T10:20:37.0402627Z Building wheel for libtorch_agnostic_2_9 (pyproject.toml) ... [?25l- \ | done 2025-12-04T10:20:37.0411421Z [?25h Created wheel for libtorch_agnostic_2_9: filename=libtorch_agnostic_2_9-0.0-cp39-abi3-linux_x86_64.whl size=55939 sha256=f1ddcf45bae02d41e5b856b198822386e3f9ad32f25e94f642992b08738735b2 2025-12-04T10:20:37.0412577Z Stored in directory: /tmp/pip-ephem-wheel-cache-i6zb3evu/wheels/e1/56/0d/91ac1e918c8015b48f6a77f66abeeb8427a8788f7d37715e0e 2025-12-04T10:20:37.0430230Z Successfully built libtorch_agnostic_2_9 2025-12-04T10:20:37.3113710Z Installing collected packages: libtorch_agnostic_2_9 2025-12-04T10:20:37.3265110Z Successfully installed libtorch_agnostic_2_9-0.0 2025-12-04T10:20:37.6604290Z Processing /var/lib/jenkins/workspace/test/cpp_extensions/libtorch_agnostic_2_10_extension 2025-12-04T10:20:39.3570989Z Preparing metadata (pyproject.toml) ... [?25l- done 2025-12-04T10:20:39.3594031Z [?25hRequirement already satisfied: torch in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from libtorch_agnostic_2_10==0.0) (2.10.0a0+gitffd9b0f) 2025-12-04T10:20:39.3617364Z Requirement already satisfied: filelock in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (3.18.0) 2025-12-04T10:20:39.3622191Z Requirement already satisfied: typing-extensions>=4.10.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (4.12.2) 2025-12-04T10:20:39.3626194Z Requirement already satisfied: sympy>=1.13.3 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (1.13.3) 2025-12-04T10:20:39.3630300Z Requirement already satisfied: networkx>=2.5.1 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (2.8.8) 2025-12-04T10:20:39.3633304Z Requirement already satisfied: jinja2 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (3.1.6) 2025-12-04T10:20:39.3637325Z Requirement already satisfied: fsspec>=0.8.5 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from torch->libtorch_agnostic_2_10==0.0) (2025.10.0) 2025-12-04T10:20:39.3975079Z Requirement already satisfied: mpmath<1.4,>=1.1.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from sympy>=1.13.3->torch->libtorch_agnostic_2_10==0.0) (1.3.0) 2025-12-04T10:20:39.4023871Z Requirement already satisfied: MarkupSafe>=2.0 in /opt/conda/envs/py_3.10/lib/python3.10/site-packages (from jinja2->torch->libtorch_agnostic_2_10==0.0) (3.0.3) 2025-12-04T10:20:39.4032758Z Building wheels for collected packages: libtorch_agnostic_2_10 2025-12-04T10:20:41.8845426Z Building wheel for libtorch_agnostic_2_10 (pyproject.toml) ... [?25l- \ | done 2025-12-04T10:20:41.8854870Z [?25h Created wheel for libtorch_agnostic_2_10: filename=libtorch_agnostic_2_10-0.0-cp39-abi3-linux_x86_64.whl size=83393 sha256=075fe8b4bc244fea7a76fa81e4a6617586a59f70abeea9f7c4714240cf01a74d 2025-12-04T10:20:41.8856746Z Stored in directory: /tmp/pip-ephem-wheel-cache-yzejm40l/wheels/03/17/c4/d9b9dbd12b271a9a317a75e944d0966701385d67eac86f2c1a 2025-12-04T10:20:41.8873246Z Successfully built libtorch_agnostic_2_10 2025-12-04T10:20:42.1635826Z Installing collected packages: libtorch_agnostic_2_10 2025-12-04T10:20:42.1767057Z Successfully installed libtorch_agnostic_2_10-0.0 2025-12-04T10:20:42.2199765Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:20:42.2203977Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'test_cpp_extensions_aot_no_ninja.py', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:20:42.220064] 2025-12-04T10:20:45.3762881Z 2025-12-04T10:20:45.3763810Z test_cpp_extensions_aot_no_ninja 1/1 was successful, full logs can be found in artifacts with path test/test-reports/test_cpp_extensions_aot_no_ninja_1.1_066b94fa818468e5_.log 2025-12-04T10:20:45.3764702Z Running 0 items in this shard: 2025-12-04T10:20:45.3764897Z 2025-12-04T10:20:45.3765184Z Finished test_cpp_extensions_aot_no_ninja 1/1 ... [2025-12-04 10:20:45.376110][2522.010583685], took 1.64min 2025-12-04T10:20:45.3841125Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_cpp_extensions_aot_no_ninja/test_cpp_extensions_aot_no_ninja-b5fa4f992440af7c.xml 2025-12-04T10:20:45.4880222Z Running inductor/test_collective_autotuning 1/1 ... [2025-12-04 10:20:45.487748][2522.122225144] 2025-12-04T10:20:45.4880754Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:20:45.4883937Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_collective_autotuning.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:20:45.488081] 2025-12-04T10:20:48.2794080Z 2025-12-04T10:20:48.2795273Z inductor/test_collective_autotuning 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_collective_autotuning_1.1_550d9541b7e790e0_.log 2025-12-04T10:20:48.2796095Z Running 0 items in this shard: 2025-12-04T10:20:48.2796275Z 2025-12-04T10:20:48.2796589Z Finished inductor/test_collective_autotuning 1/1 ... [2025-12-04 10:20:48.279174][2524.913647483], took 0.05min 2025-12-04T10:20:48.2872589Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_collective_autotuning/inductor.test_collective_autotuning-26de2e09af9201ce.xml 2025-12-04T10:20:48.3142604Z Running inductor/test_halide 1/1 ... [2025-12-04 10:20:48.314024][2524.948499979] 2025-12-04T10:20:48.3143033Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:20:48.3146216Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_halide.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:20:48.314362] 2025-12-04T10:20:54.1421686Z 2025-12-04T10:20:54.1422497Z inductor/test_halide 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_halide_1.1_1cd6f628b6d78e53_.log 2025-12-04T10:20:54.1423092Z 2025-12-04T10:20:54.1423343Z Finished inductor/test_halide 1/1 ... [2025-12-04 10:20:54.141876][2530.776352586], took 0.10min 2025-12-04T10:20:54.1504543Z Running inductor/test_aot_inductor_utils 1/1 ... [2025-12-04 10:20:54.150180][2530.784658513] 2025-12-04T10:20:54.1505023Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:20:54.1508473Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_aot_inductor_utils.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:20:54.150535] 2025-12-04T10:20:59.6098128Z 2025-12-04T10:20:59.6099074Z inductor/test_aot_inductor_utils 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_aot_inductor_utils_1.1_64c4865198c449ee_.log 2025-12-04T10:20:59.6099861Z Running 0 items in this shard: 2025-12-04T10:20:59.6100040Z 2025-12-04T10:20:59.6100343Z Finished inductor/test_aot_inductor_utils 1/1 ... [2025-12-04 10:20:59.609601][2536.24407723], took 0.09min 2025-12-04T10:20:59.6177485Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_aot_inductor_utils/inductor.test_aot_inductor_utils-6741be3d1dc90f7c.xml 2025-12-04T10:20:59.6920063Z Running dynamo/test_graph_region_tracker 1/1 ... [2025-12-04 10:20:59.691742][2536.326220927] 2025-12-04T10:20:59.6920571Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:20:59.6923526Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_graph_region_tracker.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:20:59.692059] 2025-12-04T10:21:02.7941016Z 2025-12-04T10:21:02.7942061Z dynamo/test_graph_region_tracker 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_graph_region_tracker_1.1_9fc1ac46ca6a3092_.log 2025-12-04T10:21:02.7942878Z Running 0 items in this shard: 2025-12-04T10:21:02.7943058Z 2025-12-04T10:21:02.7943359Z Finished dynamo/test_graph_region_tracker 1/1 ... [2025-12-04 10:21:02.793853][2539.428326383], took 0.05min 2025-12-04T10:21:02.8023156Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_graph_region_tracker/dynamo.test_graph_region_tracker-29cd727ff584e69e.xml 2025-12-04T10:21:02.8298597Z Running dynamo/test_unittest 1/1 ... [2025-12-04 10:21:02.829584][2539.464061528] 2025-12-04T10:21:02.8299491Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:21:02.8302569Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_unittest.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:21:02.829936] 2025-12-04T10:21:05.8999249Z 2025-12-04T10:21:05.9000246Z dynamo/test_unittest 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_unittest_1.1_5b6ad9c3fddc9671_.log 2025-12-04T10:21:05.9001241Z Running 0 items in this shard: 2025-12-04T10:21:05.9001481Z 2025-12-04T10:21:05.9001867Z Finished dynamo/test_unittest 1/1 ... [2025-12-04 10:21:05.899688][2542.534162355], took 0.05min 2025-12-04T10:21:05.9084556Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_unittest/dynamo.test_unittest-2dedd17aa5d99b38.xml 2025-12-04T10:21:05.9329705Z Running inductor/test_compile 1/1 ... [2025-12-04 10:21:05.932738][2542.567215343] 2025-12-04T10:21:05.9330295Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:21:05.9333875Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_compile.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:21:05.933091] 2025-12-04T10:21:11.3993909Z 2025-12-04T10:21:11.3994778Z inductor/test_compile 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_compile_1.1_4a12187e152c59f0_.log 2025-12-04T10:21:11.3995491Z Running 0 items in this shard: 2025-12-04T10:21:11.3995661Z 2025-12-04T10:21:11.3996191Z Finished inductor/test_compile 1/1 ... [2025-12-04 10:21:11.399148][2548.033624869], took 0.09min 2025-12-04T10:21:11.4079401Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_compile/inductor.test_compile-1f9b47dff76a97ed.xml 2025-12-04T10:21:11.4373378Z Running dynamo/test_functions 1/1 ... [2025-12-04 10:21:11.437102][2548.071581312] 2025-12-04T10:21:11.4373817Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:21:11.4377043Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_functions.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:21:11.437406] 2025-12-04T10:21:17.6350704Z 2025-12-04T10:21:17.6351805Z dynamo/test_functions 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_functions_1.1_330f01649095c7d8_.log 2025-12-04T10:21:17.6352604Z Running 0 items in this shard: 2025-12-04T10:21:17.6352910Z 2025-12-04T10:21:17.6353186Z Finished dynamo/test_functions 1/1 ... [2025-12-04 10:21:17.634840][2554.269316527], took 0.10min 2025-12-04T10:21:17.6437089Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_functions/dynamo.test_functions-abfa29da9b6f3fb3.xml 2025-12-04T10:21:17.7164681Z Running inductor/test_ordered_set 1/1 ... [2025-12-04 10:21:17.716189][2554.350667879] 2025-12-04T10:21:17.7165130Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:21:17.7167960Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_ordered_set.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:21:17.716501] 2025-12-04T10:21:21.0972617Z 2025-12-04T10:21:21.0973699Z inductor/test_ordered_set 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_ordered_set_1.1_0d7f7a7fdedccd6f_.log 2025-12-04T10:21:21.0974784Z Running 0 items in this shard: 2025-12-04T10:21:21.0974964Z 2025-12-04T10:21:21.0975238Z Finished inductor/test_ordered_set 1/1 ... [2025-12-04 10:21:21.097021][2557.731497599], took 0.06min 2025-12-04T10:21:21.1059279Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_ordered_set/inductor.test_ordered_set-d135623a87a3c057.xml 2025-12-04T10:21:21.1349437Z Running dynamo/test_install_free_tensors 1/1 ... [2025-12-04 10:21:21.134698][2557.769176606] 2025-12-04T10:21:21.1349910Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:21:21.1352744Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_install_free_tensors.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:21:21.135014] 2025-12-04T10:21:24.2483223Z 2025-12-04T10:21:24.2484030Z dynamo/test_install_free_tensors 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_install_free_tensors_1.1_afcc566d5ba882b3_.log 2025-12-04T10:21:24.2484670Z Running 0 items in this shard: 2025-12-04T10:21:24.2484813Z 2025-12-04T10:21:24.2485047Z Finished dynamo/test_install_free_tensors 1/1 ... [2025-12-04 10:21:24.248089][2560.882564442], took 0.05min 2025-12-04T10:21:24.2572084Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_install_free_tensors/dynamo.test_install_free_tensors-b1819f6ae3648480.xml 2025-12-04T10:21:24.2866812Z Running inductor/test_torchinductor_codegen_config_overrides 1/1 ... [2025-12-04 10:21:24.286391][2560.920868492] 2025-12-04T10:21:24.2867483Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:21:24.2870517Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor_codegen_config_overrides.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:21:24.286699] 2025-12-04T10:21:29.7305529Z 2025-12-04T10:21:29.7306622Z inductor/test_torchinductor_codegen_config_overrides 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_codegen_config_overrides_1.1_7ffeab4b4f5448ff_.log 2025-12-04T10:21:29.7307726Z Running 0 items in this shard: 2025-12-04T10:21:29.7307902Z 2025-12-04T10:21:29.7308582Z Finished inductor/test_torchinductor_codegen_config_overrides 1/1 ... [2025-12-04 10:21:29.730325][2566.364799534], took 0.09min 2025-12-04T10:21:29.7396096Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_torchinductor_codegen_config_overrides/inductor.test_torchinductor_codegen_config_overrides-9553b7353fc11e83.xml 2025-12-04T10:21:29.7695493Z Running export/test_passes 1/1 ... [2025-12-04 10:21:29.769304][2566.403781588] 2025-12-04T10:21:29.7695938Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:21:29.7698867Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'export/test_passes.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:21:29.769616] 2025-12-04T10:21:33.7035497Z 2025-12-04T10:21:33.7036425Z export/test_passes 1/1 was successful, full logs can be found in artifacts with path test/test-reports/export.test_passes_1.1_9a890949cbdad883_.log 2025-12-04T10:21:33.7037290Z Running 0 items in this shard: 2025-12-04T10:21:33.7037502Z 2025-12-04T10:21:33.7037797Z Finished export/test_passes 1/1 ... [2025-12-04 10:21:33.703310][2570.337784253], took 0.07min 2025-12-04T10:21:33.7127993Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/export.test_passes/export.test_passes-6a90abd3a745b76d.xml 2025-12-04T10:21:33.7480477Z Running dynamo/test_autograd_function 1/1 ... [2025-12-04 10:21:33.747791][2570.382268756] 2025-12-04T10:21:33.7480948Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:21:33.7483997Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_autograd_function.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:21:33.748103] 2025-12-04T10:21:39.2624059Z 2025-12-04T10:21:39.2624890Z dynamo/test_autograd_function 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_autograd_function_1.1_bef268133c355af5_.log 2025-12-04T10:21:39.2625527Z Running 0 items in this shard: 2025-12-04T10:21:39.2625679Z 2025-12-04T10:21:39.2625918Z Finished dynamo/test_autograd_function 1/1 ... [2025-12-04 10:21:39.262176][2575.896653701], took 0.09min 2025-12-04T10:21:39.2721364Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_autograd_function/dynamo.test_autograd_function-00c0455c5def0d5c.xml 2025-12-04T10:21:39.3012649Z Running inductor/test_codecache 1/1 ... [2025-12-04 10:21:39.301017][2575.935495972] 2025-12-04T10:21:39.3013086Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:21:39.3015983Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_codecache.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:21:39.301325] 2025-12-04T10:21:45.2345032Z 2025-12-04T10:21:45.2346262Z inductor/test_codecache 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_codecache_1.1_fbe410a98ef19d73_.log 2025-12-04T10:21:45.2347078Z Running 0 items in this shard: 2025-12-04T10:21:45.2347378Z 2025-12-04T10:21:45.2347679Z Finished inductor/test_codecache 1/1 ... [2025-12-04 10:21:45.234233][2581.868706512], took 0.10min 2025-12-04T10:21:45.2442019Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_codecache/inductor.test_codecache-0a4f995bcf28ccb9.xml 2025-12-04T10:21:45.2712086Z Running complex_tensor/test_complex_tensor 2/3 ... [2025-12-04 10:21:45.270961][2581.905438467] 2025-12-04T10:21:45.2712574Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:21:45.2716006Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'complex_tensor/test_complex_tensor.py', '-m', 'serial', '--shard-id=2', '--num-shards=3', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:21:45.271292] 2025-12-04T10:21:50.2164826Z 2025-12-04T10:21:50.2165933Z complex_tensor/test_complex_tensor 2/3 was successful, full logs can be found in artifacts with path test/test-reports/complex_tensor.test_complex_tensor_2.3_7c46d523192cf8e5_.log 2025-12-04T10:21:50.2166729Z Running 0 items in this shard: 2025-12-04T10:21:50.2166909Z 2025-12-04T10:21:50.2167210Z Finished complex_tensor/test_complex_tensor 2/3 ... [2025-12-04 10:21:50.216217][2586.850692416], took 0.08min 2025-12-04T10:21:50.2262637Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/complex_tensor.test_complex_tensor/complex_tensor.test_complex_tensor-7f6e01a72670401d.xml 2025-12-04T10:21:50.2574180Z Running optim/test_lrscheduler 1/1 ... [2025-12-04 10:21:50.257204][2586.891681532] 2025-12-04T10:21:50.2574621Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:21:50.2578163Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'optim/test_lrscheduler.py', '-m', 'serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:21:50.257550] 2025-12-04T10:21:53.0188385Z 2025-12-04T10:21:53.0189291Z optim/test_lrscheduler 1/1 was successful, full logs can be found in artifacts with path test/test-reports/optim.test_lrscheduler_1.1_33ec11e4104f54ed_.log 2025-12-04T10:21:53.0189933Z 2025-12-04T10:21:53.0190210Z Finished optim/test_lrscheduler 1/1 ... [2025-12-04 10:21:53.018591][2589.65306592], took 0.05min 2025-12-04T10:21:56.0282699Z Running inductor/test_collective_autotuning 1/1 ... [2025-12-04 10:21:56.027788][2592.662262466] 2025-12-04T10:21:56.0283263Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:21:56.0285806Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_collective_autotuning.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:21:56.028242] 2025-12-04T10:21:56.0636471Z Running inductor/test_halide 1/1 ... [2025-12-04 10:21:56.063246][2592.697721094] 2025-12-04T10:21:56.0636922Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:21:56.0640817Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_halide.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:21:56.063692] 2025-12-04T10:21:56.0642374Z Running inductor/test_aot_inductor_utils 1/1 ... [2025-12-04 10:21:56.063924][2592.698399509] 2025-12-04T10:21:56.0642830Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:21:56.0646345Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_aot_inductor_utils.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:21:56.064326] 2025-12-04T10:21:59.3056260Z 2025-12-04T10:21:59.3057431Z inductor/test_collective_autotuning 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_collective_autotuning_1.1_1c394bc2a0574cb0_.log 2025-12-04T10:21:59.3058252Z Running 0 items in this shard: 2025-12-04T10:21:59.3058427Z 2025-12-04T10:21:59.3058760Z Finished inductor/test_collective_autotuning 1/1 ... [2025-12-04 10:21:59.305447][2595.939921291], took 0.05min 2025-12-04T10:21:59.3146105Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_collective_autotuning/inductor.test_collective_autotuning-d51a031c86e0e3ba.xml 2025-12-04T10:22:02.5092214Z 2025-12-04T10:22:02.5093378Z inductor/test_halide 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_halide_1.1_58d4617b51948353_.log 2025-12-04T10:22:02.5094271Z 2025-12-04T10:22:02.5094524Z Finished inductor/test_halide 1/1 ... [2025-12-04 10:22:02.509082][2599.143555074], took 0.11min 2025-12-04T10:22:03.0290210Z Running dynamo/test_graph_region_tracker 1/1 ... [2025-12-04 10:22:03.028522][2599.66299564] 2025-12-04T10:22:03.0291054Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:03.0293115Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_graph_region_tracker.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:22:03.028991] 2025-12-04T10:22:04.0907603Z 2025-12-04T10:22:04.0908544Z inductor/test_aot_inductor_utils 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_aot_inductor_utils_1.1_4806681524bae620_.log 2025-12-04T10:22:04.0909682Z Running 0 items in this shard: 2025-12-04T10:22:04.0909852Z 2025-12-04T10:22:04.0910141Z Finished inductor/test_aot_inductor_utils 1/1 ... [2025-12-04 10:22:04.090634][2600.725108328], took 0.13min 2025-12-04T10:22:04.0999880Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_aot_inductor_utils/inductor.test_aot_inductor_utils-ee8350aedee45242.xml 2025-12-04T10:22:06.3087223Z Running dynamo/test_unittest 1/1 ... [2025-12-04 10:22:06.308213][2602.942686598] 2025-12-04T10:22:06.3087718Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:06.3090198Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_unittest.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:22:06.308704] 2025-12-04T10:22:06.6817107Z 2025-12-04T10:22:06.6818269Z dynamo/test_graph_region_tracker 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_graph_region_tracker_1.1_9d09536ee8be792f_.log 2025-12-04T10:22:06.6819114Z Running 0 items in this shard: 2025-12-04T10:22:06.6819287Z 2025-12-04T10:22:06.6819581Z Finished dynamo/test_graph_region_tracker 1/1 ... [2025-12-04 10:22:06.681541][2603.316018201], took 0.06min 2025-12-04T10:22:06.6958362Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_graph_region_tracker/dynamo.test_graph_region_tracker-06f3ef79430d8b50.xml 2025-12-04T10:22:07.7449493Z Running inductor/test_compile 1/1 ... [2025-12-04 10:22:07.744482][2604.378956587] 2025-12-04T10:22:07.7450007Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:07.7454138Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_compile.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:22:07.744989] 2025-12-04T10:22:09.9584288Z 2025-12-04T10:22:09.9585166Z dynamo/test_unittest 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_unittest_1.1_c9f1cab89a7c66a5_.log 2025-12-04T10:22:09.9585858Z Running 0 items in this shard: 2025-12-04T10:22:09.9586033Z 2025-12-04T10:22:09.9586287Z Finished dynamo/test_unittest 1/1 ... [2025-12-04 10:22:09.958247][2606.592723328], took 0.06min 2025-12-04T10:22:09.9681994Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_unittest/dynamo.test_unittest-8729607d2891c907.xml 2025-12-04T10:22:10.4006066Z Running dynamo/test_functions 1/1 ... [2025-12-04 10:22:10.400156][2607.034630199] 2025-12-04T10:22:10.4006552Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:10.4008642Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_functions.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:22:10.400562] 2025-12-04T10:22:13.6645969Z Running inductor/test_ordered_set 1/1 ... [2025-12-04 10:22:13.664146][2610.29862032] 2025-12-04T10:22:13.6646692Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:13.6649182Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_ordered_set.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:22:13.664575] 2025-12-04T10:22:14.0171507Z 2025-12-04T10:22:14.0172682Z inductor/test_compile 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_compile_1.1_a8a66c5b22feb377_.log 2025-12-04T10:22:14.0173915Z Running 0 items in this shard: 2025-12-04T10:22:14.0174154Z 2025-12-04T10:22:14.0174471Z Finished inductor/test_compile 1/1 ... [2025-12-04 10:22:14.016976][2610.651454019], took 0.10min 2025-12-04T10:22:14.0272513Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_compile/inductor.test_compile-de6ee8a8e08eed81.xml 2025-12-04T10:22:17.4610565Z 2025-12-04T10:22:17.4611471Z dynamo/test_functions 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_functions_1.1_f1b0f3ce8ba833d3_.log 2025-12-04T10:22:17.4612074Z Running 0 items in this shard: 2025-12-04T10:22:17.4612215Z 2025-12-04T10:22:17.4612450Z Finished dynamo/test_functions 1/1 ... [2025-12-04 10:22:17.460651][2614.095127204], took 0.12min 2025-12-04T10:22:17.4711911Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_functions/dynamo.test_functions-bf5934b50d848f7f.xml 2025-12-04T10:22:17.5421283Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_ordered_set/inductor.test_ordered_set-cd0f012a3cd8e7fd.xml 2025-12-04T10:22:17.7242247Z 2025-12-04T10:22:17.7243410Z inductor/test_ordered_set 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_ordered_set_1.1_49f52951685697ab_.log 2025-12-04T10:22:17.7244197Z Running 0 items in this shard: 2025-12-04T10:22:17.7244373Z 2025-12-04T10:22:17.7244655Z Finished inductor/test_ordered_set 1/1 ... [2025-12-04 10:22:17.724106][2614.358578087], took 0.07min 2025-12-04T10:22:17.7914525Z Running dynamo/test_install_free_tensors 1/1 ... [2025-12-04 10:22:17.791028][2614.425500225] 2025-12-04T10:22:17.7915028Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:17.7918740Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_install_free_tensors.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:22:17.791510] 2025-12-04T10:22:21.1515853Z Running inductor/test_torchinductor_codegen_config_overrides 1/1 ... [2025-12-04 10:22:21.151043][2617.785516866] 2025-12-04T10:22:21.1516830Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:21.1518368Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_torchinductor_codegen_config_overrides.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:22:21.151460] 2025-12-04T10:22:21.4541061Z 2025-12-04T10:22:21.4542323Z dynamo/test_install_free_tensors 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_install_free_tensors_1.1_7d2bd3620c394d13_.log 2025-12-04T10:22:21.4543491Z Running 0 items in this shard: 2025-12-04T10:22:21.4543665Z 2025-12-04T10:22:21.4543957Z Finished dynamo/test_install_free_tensors 1/1 ... [2025-12-04 10:22:21.453964][2618.088436331], took 0.06min 2025-12-04T10:22:21.4650826Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_install_free_tensors/dynamo.test_install_free_tensors-59f3a298f57a1d82.xml 2025-12-04T10:22:21.5197435Z Running export/test_passes 1/1 ... [2025-12-04 10:22:21.519367][2618.153842106] 2025-12-04T10:22:21.5197884Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:21.5201262Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'export/test_passes.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:22:21.519818] 2025-12-04T10:22:25.1345736Z Running dynamo/test_autograd_function 1/1 ... [2025-12-04 10:22:25.134082][2621.768556202] 2025-12-04T10:22:25.1346348Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:25.1348478Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'dynamo/test_autograd_function.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:22:25.134504] 2025-12-04T10:22:26.1680535Z 2025-12-04T10:22:26.1681440Z export/test_passes 1/1 was successful, full logs can be found in artifacts with path test/test-reports/export.test_passes_1.1_06978b298f6392da_.log 2025-12-04T10:22:26.1682171Z Running 0 items in this shard: 2025-12-04T10:22:26.1682340Z 2025-12-04T10:22:26.1682618Z Finished export/test_passes 1/1 ... [2025-12-04 10:22:26.167913][2622.802389346], took 0.08min 2025-12-04T10:22:26.1862975Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/export.test_passes/export.test_passes-3e908af53b230225.xml 2025-12-04T10:22:27.4432055Z 2025-12-04T10:22:27.4433955Z inductor/test_torchinductor_codegen_config_overrides 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_torchinductor_codegen_config_overrides_1.1_d73cc8b4d15ded76_.log 2025-12-04T10:22:27.4435251Z Running 0 items in this shard: 2025-12-04T10:22:27.4435480Z 2025-12-04T10:22:27.4436005Z Finished inductor/test_torchinductor_codegen_config_overrides 1/1 ... [2025-12-04 10:22:27.443042][2624.07751835], took 0.10min 2025-12-04T10:22:27.4579109Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_torchinductor_codegen_config_overrides/inductor.test_torchinductor_codegen_config_overrides-9b006bc55755cf8f.xml 2025-12-04T10:22:29.8974780Z Running inductor/test_codecache 1/1 ... [2025-12-04 10:22:29.897005][2626.531480482] 2025-12-04T10:22:29.8975331Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:29.8977253Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'inductor/test_codecache.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:22:29.897425] 2025-12-04T10:22:31.1652672Z Running complex_tensor/test_complex_tensor 2/3 ... [2025-12-04 10:22:31.164768][2627.79924132] 2025-12-04T10:22:31.1653198Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:31.1655565Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'complex_tensor/test_complex_tensor.py', '-m', 'not serial', '--shard-id=2', '--num-shards=3', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:22:31.165201] 2025-12-04T10:22:31.6014875Z 2025-12-04T10:22:31.6015954Z dynamo/test_autograd_function 1/1 was successful, full logs can be found in artifacts with path test/test-reports/dynamo.test_autograd_function_1.1_d50170967ccb6cc8_.log 2025-12-04T10:22:31.6017011Z Running 0 items in this shard: 2025-12-04T10:22:31.6017294Z 2025-12-04T10:22:31.6017762Z Finished dynamo/test_autograd_function 1/1 ... [2025-12-04 10:22:31.601366][2628.235842047], took 0.11min 2025-12-04T10:22:31.6130671Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/dynamo.test_autograd_function/dynamo.test_autograd_function-4ac3d5673f9a4827.xml 2025-12-04T10:22:35.3289537Z Running optim/test_lrscheduler 1/1 ... [2025-12-04 10:22:35.328501][2631.962976011] 2025-12-04T10:22:35.3290252Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:35.3293263Z Executing ['/opt/conda/envs/py_3.10/bin/python', '-bb', 'optim/test_lrscheduler.py', '-m', 'not serial', '--shard-id=1', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--flake-finder', '--flake-runs=50', '--import-slow-tests', '--import-disabled-tests', '--rerun-disabled-tests'] ... [2025-12-04 10:22:35.328966] 2025-12-04T10:22:36.5994634Z 2025-12-04T10:22:36.5995772Z complex_tensor/test_complex_tensor 2/3 was successful, full logs can be found in artifacts with path test/test-reports/complex_tensor.test_complex_tensor_2.3_8c97df55eaaa8b55_.log 2025-12-04T10:22:36.5996722Z Running 0 items in this shard: 2025-12-04T10:22:36.5996946Z 2025-12-04T10:22:36.5997194Z Finished complex_tensor/test_complex_tensor 2/3 ... [2025-12-04 10:22:36.599399][2633.233872256], took 0.09min 2025-12-04T10:22:36.6130780Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/inductor.test_codecache/inductor.test_codecache-5b172dd2c0b9882d.xml 2025-12-04T10:22:36.6483242Z Parsing testcases for test report: /var/lib/jenkins/workspace/test/test-reports/python-pytest/complex_tensor.test_complex_tensor/complex_tensor.test_complex_tensor-57b4d0d1903d84ca.xml 2025-12-04T10:22:36.7647003Z 2025-12-04T10:22:36.7648103Z inductor/test_codecache 1/1 was successful, full logs can be found in artifacts with path test/test-reports/inductor.test_codecache_1.1_83e4acb8e97ccfe4_.log 2025-12-04T10:22:36.7648847Z Running 0 items in this shard: 2025-12-04T10:22:36.7649026Z 2025-12-04T10:22:36.7649289Z Finished inductor/test_codecache 1/1 ... [2025-12-04 10:22:36.764558][2633.39903123], took 0.11min 2025-12-04T10:22:38.5878479Z 2025-12-04T10:22:38.5879449Z optim/test_lrscheduler 1/1 was successful, full logs can be found in artifacts with path test/test-reports/optim.test_lrscheduler_1.1_5edfc3f4cf508994_.log 2025-12-04T10:22:38.5880087Z 2025-12-04T10:22:38.5880388Z Finished optim/test_lrscheduler 1/1 ... [2025-12-04 10:22:38.587707][2635.22218224], took 0.05min 2025-12-04T10:22:41.1887754Z Running test batch 'tests to run' cost 1726.94 seconds 2025-12-04T10:22:41.1900005Z Emitting td_test_failure_stats_v2 2025-12-04T10:22:41.1903533Z Writing 1 documents to S3 ossci-raw-job-status/ossci_uploaded_metrics/td_test_failure_stats_v2_1764843761_2a33f192d0fb11f0b7d00242ac110002 2025-12-04T10:22:41.3026141Z Done! Finish writing document to S3 ossci-raw-job-status/ossci_uploaded_metrics/td_test_failure_stats_v2_1764843761_2a33f192d0fb11f0b7d00242ac110002 2025-12-04T10:22:41.3037044Z Emitting td_test_failure_stats_v2 2025-12-04T10:22:41.3039509Z Writing 1 documents to S3 ossci-raw-job-status/ossci_uploaded_metrics/td_test_failure_stats_v2_1764843761_2a45450ad0fb11f0b7d00242ac110002 2025-12-04T10:22:41.3423615Z Done! Finish writing document to S3 ossci-raw-job-status/ossci_uploaded_metrics/td_test_failure_stats_v2_1764843761_2a45450ad0fb11f0b7d00242ac110002 2025-12-04T10:22:41.3436884Z Emitting td_test_failure_stats_v2 2025-12-04T10:22:41.3438397Z Writing 1 documents to S3 ossci-raw-job-status/ossci_uploaded_metrics/td_test_failure_stats_v2_1764843761_2a4b5dd2d0fb11f0b7d00242ac110002 2025-12-04T10:22:41.3778215Z Done! Finish writing document to S3 ossci-raw-job-status/ossci_uploaded_metrics/td_test_failure_stats_v2_1764843761_2a4b5dd2d0fb11f0b7d00242ac110002 2025-12-04T10:22:41.3791859Z Emitting td_test_failure_stats_v2 2025-12-04T10:22:41.3793454Z Writing 1 documents to S3 ossci-raw-job-status/ossci_uploaded_metrics/td_test_failure_stats_v2_1764843761_2a50c812d0fb11f0b7d00242ac110002 2025-12-04T10:22:41.4174685Z Done! Finish writing document to S3 ossci-raw-job-status/ossci_uploaded_metrics/td_test_failure_stats_v2_1764843761_2a50c812d0fb11f0b7d00242ac110002 2025-12-04T10:22:41.4176029Z inductor/test_flex_attention 1/6 failed! 2025-12-04T10:22:41.4176481Z test_ci_sanity_check_fail 1/1 failed! 2025-12-04T10:22:41.4176895Z test_torch 1/1 failed! 2025-12-04T10:22:41.4177246Z test_fake_tensor 1/1 failed! 2025-12-04T10:22:42.3787206Z 2025-12-04T10:22:42.3787749Z real 28m52.628s 2025-12-04T10:22:42.3787985Z user 54m41.187s 2025-12-04T10:22:42.3788155Z sys 7m51.173s 2025-12-04T10:22:42.3788314Z + assert_git_not_dirty 2025-12-04T10:22:42.3788613Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug != *rocm* ]] 2025-12-04T10:22:42.3789524Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug != *xla* ]] 2025-12-04T10:22:42.3790555Z ++ git status --porcelain 2025-12-04T10:22:42.3791579Z ++ grep -v '?? third_party' 2025-12-04T10:22:45.8452192Z ++ true 2025-12-04T10:22:45.8454599Z + git_status= 2025-12-04T10:22:45.8454933Z + [[ -n '' ]] 2025-12-04T10:22:45.8455160Z + test_aten 2025-12-04T10:22:45.8455682Z + echo 'Running ATen tests with pytorch lib' 2025-12-04T10:22:45.8456020Z Running ATen tests with pytorch lib 2025-12-04T10:22:45.8456275Z + [[ -n '' ]] 2025-12-04T10:22:45.8456512Z + echo 'Running test with the build folder' 2025-12-04T10:22:45.8456798Z Running test with the build folder 2025-12-04T10:22:45.8457045Z + TEST_BASE_DIR=build/bin 2025-12-04T10:22:45.8458088Z + ln -sf /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10_cuda.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10d_cuda_test.so build/bin 2025-12-04T10:22:45.8473455Z + ln -sf /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libcaffe2_nvrtc.so build/bin 2025-12-04T10:22:45.8487487Z + ln -sf '/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libmkldnn*' build/bin 2025-12-04T10:22:45.8503132Z + ln -sf '/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libnccl*' build/bin 2025-12-04T10:22:45.8520850Z + ln -sf /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda_linalg.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_global_deps.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_nvshmem.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_python.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorchbind_test.so build/bin 2025-12-04T10:22:45.8534742Z + ls build/bin 2025-12-04T10:22:45.8555899Z BackoffTest 2025-12-04T10:22:45.8556340Z CppSignature_test 2025-12-04T10:22:45.8556706Z Dict_test 2025-12-04T10:22:45.8557030Z Dimname_test 2025-12-04T10:22:45.8557346Z FileStoreTest 2025-12-04T10:22:45.8557672Z HashStoreTest 2025-12-04T10:22:45.8558010Z IListRef_test 2025-12-04T10:22:45.8558342Z KernelFunction_test 2025-12-04T10:22:45.8558728Z List_test 2025-12-04T10:22:45.8559096Z MaybeOwned_test 2025-12-04T10:22:45.8559500Z NamedTensor_test 2025-12-04T10:22:45.8559862Z ProcessGroupGlooAsyncTest 2025-12-04T10:22:45.8560176Z ProcessGroupGlooTest 2025-12-04T10:22:45.8560394Z ProcessGroupMPITest 2025-12-04T10:22:45.8560766Z ProcessGroupNCCLErrorsTest 2025-12-04T10:22:45.8560980Z ProcessGroupNCCLTest 2025-12-04T10:22:45.8561161Z StorageUtils_test 2025-12-04T10:22:45.8561332Z TCPStoreTest 2025-12-04T10:22:45.8561598Z apply_utils_test 2025-12-04T10:22:45.8561843Z atest 2025-12-04T10:22:45.8561996Z backend_fallback_test 2025-12-04T10:22:45.8562164Z basic 2025-12-04T10:22:45.8562298Z broadcast_test 2025-12-04T10:22:45.8562465Z c10_AllocatorConfig_test 2025-12-04T10:22:45.8562645Z c10_ArrayRef_test 2025-12-04T10:22:45.8562800Z c10_Bitset_test 2025-12-04T10:22:45.8562984Z c10_CompileTimeFunctionPointer_test 2025-12-04T10:22:45.8563204Z c10_ConstexprCrc_test 2025-12-04T10:22:45.8563378Z c10_DeadlockDetection_test 2025-12-04T10:22:45.8563566Z c10_DeviceGuard_test 2025-12-04T10:22:45.8563731Z c10_Device_test 2025-12-04T10:22:45.8563894Z c10_DispatchKeySet_test 2025-12-04T10:22:45.8564064Z c10_Enumerate_test 2025-12-04T10:22:45.8564220Z c10_Half_test 2025-12-04T10:22:45.8564400Z c10_InlineDeviceGuard_test 2025-12-04T10:22:45.8564590Z c10_InlineStreamGuard_test 2025-12-04T10:22:45.8564772Z c10_IntrusiveList_test 2025-12-04T10:22:45.8564942Z c10_LeftRight_test 2025-12-04T10:22:45.8565098Z c10_NetworkFlow_test 2025-12-04T10:22:45.8565361Z c10_Scalar_test 2025-12-04T10:22:45.8565513Z c10_Semaphore_test 2025-12-04T10:22:45.8565673Z c10_SizesAndStrides_test 2025-12-04T10:22:45.8565848Z c10_StreamGuard_test 2025-12-04T10:22:45.8566005Z c10_SymInt_test 2025-12-04T10:22:45.8566156Z c10_Synchronized_test 2025-12-04T10:22:45.8566320Z c10_ThreadLocal_test 2025-12-04T10:22:45.8566481Z c10_TypeIndex_test 2025-12-04T10:22:45.8566629Z c10_accumulate_test 2025-12-04T10:22:45.8566788Z c10_bfloat16_test 2025-12-04T10:22:45.8566945Z c10_bit_cast_test 2025-12-04T10:22:45.8567094Z c10_complex_math_test 2025-12-04T10:22:45.8567255Z c10_complex_test 2025-12-04T10:22:45.8567408Z c10_cow_test 2025-12-04T10:22:45.8567570Z c10_cuda_CUDAAssertionsTest_1_var_test 2025-12-04T10:22:45.8567811Z c10_cuda_CUDAAssertionsTest_catches_stream 2025-12-04T10:22:45.8568126Z c10_cuda_CUDAAssertionsTest_catches_thread_and_block_and_device 2025-12-04T10:22:45.8568434Z c10_cuda_CUDAAssertionsTest_from_2_processes 2025-12-04T10:22:45.8568738Z c10_cuda_CUDAAssertionsTest_multiple_writes_from_blocks_and_threads 2025-12-04T10:22:45.8569118Z c10_cuda_CUDAAssertionsTest_multiple_writes_from_multiple_blocks 2025-12-04T10:22:45.8569455Z c10_cuda_CUDAAssertionsTest_multiple_writes_from_same_block 2025-12-04T10:22:45.8569712Z c10_cuda_CUDATest 2025-12-04T10:22:45.8569876Z c10_error_test 2025-12-04T10:22:45.8570046Z c10_exception_test 2025-12-04T10:22:45.8570206Z c10_flags_test 2025-12-04T10:22:45.8570363Z c10_generic_math_test 2025-12-04T10:22:45.8570535Z c10_intrusive_ptr_benchmark 2025-12-04T10:22:45.8570728Z c10_intrusive_ptr_test 2025-12-04T10:22:45.8570898Z c10_irange_test 2025-12-04T10:22:45.8571047Z c10_lazy_test 2025-12-04T10:22:45.8571198Z c10_logging_test 2025-12-04T10:22:45.8571353Z c10_nofatal_test 2025-12-04T10:22:45.8571504Z c10_optional_test 2025-12-04T10:22:45.8571677Z c10_ordered_preserving_dict_test 2025-12-04T10:22:45.8571881Z c10_registry_test 2025-12-04T10:22:45.8572107Z c10_small_vector_test 2025-12-04T10:22:45.8572272Z c10_ssize_test 2025-12-04T10:22:45.8572434Z c10_string_util_test 2025-12-04T10:22:45.8572602Z c10_string_view_test 2025-12-04T10:22:45.8572756Z c10_tempfile_test 2025-12-04T10:22:45.8572906Z c10_typeid_test 2025-12-04T10:22:45.8573060Z cpu_allocator_test 2025-12-04T10:22:45.8573211Z cpu_generator_test 2025-12-04T10:22:45.8573375Z cpu_profiling_allocator_test 2025-12-04T10:22:45.8573559Z cpu_rng_test 2025-12-04T10:22:45.8573724Z cuda_allocatorTraceTracker_test 2025-12-04T10:22:45.8573920Z cuda_allocator_test 2025-12-04T10:22:45.8574082Z cuda_apply_test 2025-12-04T10:22:45.8574233Z cuda_atomic_ops_test 2025-12-04T10:22:45.8574412Z cuda_caching_host_allocator_test 2025-12-04T10:22:45.8574615Z cuda_complex_math_test 2025-12-04T10:22:45.8574778Z cuda_complex_test 2025-12-04T10:22:45.8574975Z cuda_cub_test 2025-12-04T10:22:45.8575134Z cuda_cublas_handle_pool_test 2025-12-04T10:22:45.8575310Z cuda_cudnn_test 2025-12-04T10:22:45.8575463Z cuda_device_test 2025-12-04T10:22:45.8575686Z cuda_distributions_test 2025-12-04T10:22:45.8575865Z cuda_dlconvertor_test 2025-12-04T10:22:45.8576033Z cuda_event_test 2025-12-04T10:22:45.8576194Z cuda_exchange_device_test 2025-12-04T10:22:45.8576373Z cuda_generator_test 2025-12-04T10:22:45.8576532Z cuda_half_test 2025-12-04T10:22:45.8576689Z cuda_integer_divider_test 2025-12-04T10:22:45.8576869Z cuda_optional_test 2025-12-04T10:22:45.8577049Z cuda_packedtensoraccessor_test 2025-12-04T10:22:45.8577255Z cuda_reportMemoryUsage_test 2025-12-04T10:22:45.8577453Z cuda_stream_test 2025-12-04T10:22:45.8577612Z cuda_vectorized_test 2025-12-04T10:22:45.8577781Z dlconvertor_test 2025-12-04T10:22:45.8577942Z example_allreduce 2025-12-04T10:22:45.8578113Z extension_backend_test 2025-12-04T10:22:45.8578282Z half_test 2025-12-04T10:22:45.8578526Z inline_container_test 2025-12-04T10:22:45.8578724Z ivalue_test 2025-12-04T10:22:45.8578889Z kernel_function_legacy_test 2025-12-04T10:22:45.8579088Z kernel_function_test 2025-12-04T10:22:45.8579259Z kernel_lambda_legacy_test 2025-12-04T10:22:45.8579510Z kernel_lambda_test 2025-12-04T10:22:45.8579677Z kernel_stackbased_test 2025-12-04T10:22:45.8579843Z lazy_tensor_test 2025-12-04T10:22:45.8580003Z legacy_vmap_test 2025-12-04T10:22:45.8580157Z libc10.so 2025-12-04T10:22:45.8580295Z libc10_cuda.so 2025-12-04T10:22:45.8580464Z libc10d_cuda_test.so 2025-12-04T10:22:45.8580630Z libcaffe2_nvrtc.so 2025-12-04T10:22:45.8580779Z 'libmkldnn*' 2025-12-04T10:22:45.8580923Z 'libnccl*' 2025-12-04T10:22:45.8581060Z libtorch.so 2025-12-04T10:22:45.8581203Z libtorch_cpu.so 2025-12-04T10:22:45.8581361Z libtorch_cuda.so 2025-12-04T10:22:45.8581522Z libtorch_cuda_linalg.so 2025-12-04T10:22:45.8581723Z libtorch_global_deps.so 2025-12-04T10:22:45.8581983Z libtorch_nvshmem.so 2025-12-04T10:22:45.8582154Z libtorch_python.so 2025-12-04T10:22:45.8582317Z libtorchbind_test.so 2025-12-04T10:22:45.8582492Z make_boxed_from_unboxed_functor_test 2025-12-04T10:22:45.8582699Z math_kernel_test 2025-12-04T10:22:45.8582858Z memory_format_test 2025-12-04T10:22:45.8583021Z memory_overlapping_test 2025-12-04T10:22:45.8583200Z mobile_memory_cleanup 2025-12-04T10:22:45.8583360Z native_test 2025-12-04T10:22:45.8583501Z op_allowlist_test 2025-12-04T10:22:45.8583672Z op_registration_test 2025-12-04T10:22:45.8583844Z operator_name_test 2025-12-04T10:22:45.8583993Z operators_test 2025-12-04T10:22:45.8584161Z packedtensoraccessor_test 2025-12-04T10:22:45.8584348Z parallel_benchmark 2025-12-04T10:22:45.8584499Z pow_test 2025-12-04T10:22:45.8584640Z protoc 2025-12-04T10:22:45.8584784Z protoc-3.13.0.0 2025-12-04T10:22:45.8584938Z quantized_test 2025-12-04T10:22:45.8585094Z reduce_ops_test 2025-12-04T10:22:45.8585260Z reportMemoryUsage_test 2025-12-04T10:22:45.8585428Z scalar_tensor_test 2025-12-04T10:22:45.8585601Z scalar_test 2025-12-04T10:22:45.8585755Z stride_properties_test 2025-12-04T10:22:45.8585919Z tensor_iterator_test 2025-12-04T10:22:45.8586083Z test_aoti_abi_check 2025-12-04T10:22:45.8586299Z test_api 2025-12-04T10:22:45.8586440Z test_cpp_rpc 2025-12-04T10:22:45.8586599Z test_dist_autograd 2025-12-04T10:22:45.8586765Z test_jit 2025-12-04T10:22:45.8586896Z test_lazy 2025-12-04T10:22:45.8587038Z test_parallel 2025-12-04T10:22:45.8587190Z test_vec_half_AVX2 2025-12-04T10:22:45.8587470Z test_vec_half_AVX512 2025-12-04T10:22:45.8587638Z test_vec_half_DEFAULT 2025-12-04T10:22:45.8587801Z thread_init_test 2025-12-04T10:22:45.8587964Z torch_shm_manager 2025-12-04T10:22:45.8588114Z type_ptr_test 2025-12-04T10:22:45.8588262Z type_test 2025-12-04T10:22:45.8588410Z undefined_tensor_test 2025-12-04T10:22:45.8588575Z vec_test_all_types_AVX2 2025-12-04T10:22:45.8588757Z vec_test_all_types_AVX512 2025-12-04T10:22:45.8588940Z vec_test_all_types_DEFAULT 2025-12-04T10:22:45.8589119Z verify_api_visibility 2025-12-04T10:22:45.8589338Z weakref_test 2025-12-04T10:22:45.8589485Z wrapdim_test 2025-12-04T10:22:45.8589636Z xla_tensor_test 2025-12-04T10:22:45.8589806Z + aten/tools/run_tests.sh build/bin 2025-12-04T10:22:45.8590052Z + set -e 2025-12-04T10:22:45.8590204Z ++ dirname aten/tools/run_tests.sh 2025-12-04T10:22:45.8596005Z + VALGRIND_SUP=/var/lib/jenkins/workspace/aten/tools/valgrind.sup 2025-12-04T10:22:45.8596320Z + export CPP_TESTS_DIR=build/bin 2025-12-04T10:22:45.8596526Z + CPP_TESTS_DIR=build/bin 2025-12-04T10:22:45.8596703Z + VALGRIND=ON 2025-12-04T10:22:45.8597916Z + python test/run_test.py --cpp --verbose -i cpp/basic cpp/atest cpp/scalar_test cpp/broadcast_test cpp/wrapdim_test cpp/apply_utils_test cpp/dlconvertor_test cpp/native_test cpp/scalar_tensor_test cpp/undefined_tensor_test cpp/extension_backend_test cpp/lazy_tensor_test cpp/tensor_iterator_test cpp/Dimname_test cpp/Dict_test cpp/NamedTensor_test cpp/cpu_generator_test cpp/legacy_vmap_test cpp/operators_test 2025-12-04T10:22:50.2677503Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T10:22:50.2756441Z Found test times from artifacts 2025-12-04T10:22:50.3076399Z Found test times from artifacts 2025-12-04T10:22:50.3085723Z Running all tests 2025-12-04T10:22:50.3090026Z Running parallel tests on 3 processes 2025-12-04T10:22:50.3091146Z Name: tests to run (est. time: 0.0min) 2025-12-04T10:22:50.3091509Z Serial tests (0): 2025-12-04T10:22:50.3091697Z Parallel tests (19): 2025-12-04T10:22:50.3091874Z cpp/Dict_test 1/1 2025-12-04T10:22:50.3092057Z cpp/Dimname_test 1/1 2025-12-04T10:22:50.3092257Z cpp/NamedTensor_test 1/1 2025-12-04T10:22:50.3092446Z cpp/apply_utils_test 1/1 2025-12-04T10:22:50.3092630Z cpp/atest 1/1 2025-12-04T10:22:50.3092789Z cpp/basic 1/1 2025-12-04T10:22:50.3092952Z cpp/broadcast_test 1/1 2025-12-04T10:22:50.3093142Z cpp/cpu_generator_test 1/1 2025-12-04T10:22:50.3093351Z cpp/dlconvertor_test 1/1 2025-12-04T10:22:50.3093557Z cpp/extension_backend_test 1/1 2025-12-04T10:22:50.3093759Z cpp/lazy_tensor_test 1/1 2025-12-04T10:22:50.3093944Z cpp/legacy_vmap_test 1/1 2025-12-04T10:22:50.3094132Z cpp/native_test 1/1 2025-12-04T10:22:50.3094309Z cpp/operators_test 1/1 2025-12-04T10:22:50.3094500Z cpp/scalar_tensor_test 1/1 2025-12-04T10:22:50.3094688Z cpp/scalar_test 1/1 2025-12-04T10:22:50.3094863Z cpp/tensor_iterator_test 1/1 2025-12-04T10:22:50.3095069Z cpp/undefined_tensor_test 1/1 2025-12-04T10:22:50.3095294Z cpp/wrapdim_test 1/1 2025-12-04T10:22:50.3095482Z Name: excluded (est. time: 0.0min) 2025-12-04T10:22:50.3095692Z Serial tests (0): 2025-12-04T10:22:50.3095954Z Parallel tests (0): 2025-12-04T10:22:50.3097169Z Running cpp/Dict_test 1/1 ... [2025-12-04 10:22:50.309528][2646.944006855] 2025-12-04T10:22:50.3097617Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:50.3098219Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:50.3098606Z Finished cpp/Dict_test 1/1 ... [2025-12-04 10:22:50.309684][2646.944163828], took 0.00min 2025-12-04T10:22:51.5913008Z Uploading artifacts took 1.27 seconds 2025-12-04T10:22:51.5913809Z Running cpp/Dimname_test 1/1 ... [2025-12-04 10:22:51.590900][2648.225372165] 2025-12-04T10:22:51.5914247Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:51.5914619Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:51.5915125Z Finished cpp/Dimname_test 1/1 ... [2025-12-04 10:22:51.591221][2648.225700972], took 0.00min 2025-12-04T10:22:51.6026375Z Running cpp/NamedTensor_test 1/1 ... [2025-12-04 10:22:51.602410][2648.236890151] 2025-12-04T10:22:51.6026880Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:51.6027440Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:51.6028013Z Finished cpp/NamedTensor_test 1/1 ... [2025-12-04 10:22:51.602552][2648.237032354], took 0.00min 2025-12-04T10:22:51.6135537Z Running cpp/apply_utils_test 1/1 ... [2025-12-04 10:22:51.613314][2648.247793785] 2025-12-04T10:22:51.6135975Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:51.6136330Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:51.6136954Z Finished cpp/apply_utils_test 1/1 ... [2025-12-04 10:22:51.613452][2648.247932408], took 0.00min 2025-12-04T10:22:51.6243511Z Running cpp/atest 1/1 ... [2025-12-04 10:22:51.624130][2648.258609966] 2025-12-04T10:22:51.6243919Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:51.6244277Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:51.6244746Z Finished cpp/atest 1/1 ... [2025-12-04 10:22:51.624270][2648.258750249], took 0.00min 2025-12-04T10:22:51.6351628Z Running cpp/basic 1/1 ... [2025-12-04 10:22:51.634956][2648.269436048] 2025-12-04T10:22:51.6352006Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:51.6352566Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:51.6353280Z Finished cpp/basic 1/1 ... [2025-12-04 10:22:51.635093][2648.26957361], took 0.00min 2025-12-04T10:22:51.6461106Z Running cpp/broadcast_test 1/1 ... [2025-12-04 10:22:51.645906][2648.280386113] 2025-12-04T10:22:51.6461666Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:51.6462371Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:51.6462913Z Finished cpp/broadcast_test 1/1 ... [2025-12-04 10:22:51.646044][2648.280524456], took 0.00min 2025-12-04T10:22:51.6570124Z Running cpp/cpu_generator_test 1/1 ... [2025-12-04 10:22:51.656783][2648.291262575] 2025-12-04T10:22:51.6570773Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:51.6571199Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:51.6571817Z Finished cpp/cpu_generator_test 1/1 ... [2025-12-04 10:22:51.656916][2648.291396638], took 0.00min 2025-12-04T10:22:51.6678346Z Running cpp/dlconvertor_test 1/1 ... [2025-12-04 10:22:51.667629][2648.302108796] 2025-12-04T10:22:51.6678929Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:51.6679303Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:51.6679960Z Finished cpp/dlconvertor_test 1/1 ... [2025-12-04 10:22:51.667764][2648.30224437], took 0.00min 2025-12-04T10:22:51.6785517Z Running cpp/extension_backend_test 1/1 ... [2025-12-04 10:22:51.678339][2648.312819276] 2025-12-04T10:22:51.6786122Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:51.6786476Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:51.6787133Z Finished cpp/extension_backend_test 1/1 ... [2025-12-04 10:22:51.678473][2648.312953099], took 0.00min 2025-12-04T10:22:51.6893588Z Running cpp/lazy_tensor_test 1/1 ... [2025-12-04 10:22:51.689136][2648.323616007] 2025-12-04T10:22:51.6894155Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:51.6894510Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:51.6895015Z Finished cpp/lazy_tensor_test 1/1 ... [2025-12-04 10:22:51.689273][2648.32375325], took 0.00min 2025-12-04T10:22:51.7001589Z Running cpp/legacy_vmap_test 1/1 ... [2025-12-04 10:22:51.699942][2648.334421559] 2025-12-04T10:22:51.7002142Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:51.7002634Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:51.7003145Z Finished cpp/legacy_vmap_test 1/1 ... [2025-12-04 10:22:51.700079][2648.334559492], took 0.00min 2025-12-04T10:22:51.7111106Z Running cpp/native_test 1/1 ... [2025-12-04 10:22:51.710890][2648.345369642] 2025-12-04T10:22:51.7111542Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:51.7111920Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:51.7112404Z Finished cpp/native_test 1/1 ... [2025-12-04 10:22:51.711029][2648.345509516], took 0.00min 2025-12-04T10:22:51.7218818Z Running cpp/operators_test 1/1 ... [2025-12-04 10:22:51.721646][2648.356126143] 2025-12-04T10:22:51.7219392Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:51.7219741Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:51.7220549Z Finished cpp/operators_test 1/1 ... [2025-12-04 10:22:51.721787][2648.356267796], took 0.00min 2025-12-04T10:22:51.7327473Z Running cpp/scalar_tensor_test 1/1 ... [2025-12-04 10:22:51.732536][2648.367016266] 2025-12-04T10:22:51.7328062Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:51.7328416Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:51.7328918Z Finished cpp/scalar_tensor_test 1/1 ... [2025-12-04 10:22:51.732673][2648.367152669], took 0.00min 2025-12-04T10:22:51.7434062Z Running cpp/scalar_test 1/1 ... [2025-12-04 10:22:51.743193][2648.377673385] 2025-12-04T10:22:51.7434487Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:51.7434995Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:51.7435516Z Finished cpp/scalar_test 1/1 ... [2025-12-04 10:22:51.743339][2648.377819478], took 0.00min 2025-12-04T10:22:51.7541343Z Running cpp/tensor_iterator_test 1/1 ... [2025-12-04 10:22:51.753931][2648.388410764] 2025-12-04T10:22:51.7541772Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:51.7542110Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:51.7542642Z Finished cpp/tensor_iterator_test 1/1 ... [2025-12-04 10:22:51.754065][2648.388545737], took 0.00min 2025-12-04T10:22:51.7647736Z Running cpp/undefined_tensor_test 1/1 ... [2025-12-04 10:22:51.764564][2648.399044201] 2025-12-04T10:22:51.7648328Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:51.7648679Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:51.7649353Z Finished cpp/undefined_tensor_test 1/1 ... [2025-12-04 10:22:51.764700][2648.399180474], took 0.00min 2025-12-04T10:22:51.7754917Z Running cpp/wrapdim_test 1/1 ... [2025-12-04 10:22:51.775289][2648.409768921] 2025-12-04T10:22:51.7755512Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:51.7755995Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:51.7756515Z Finished cpp/wrapdim_test 1/1 ... [2025-12-04 10:22:51.775429][2648.409908774], took 0.00min 2025-12-04T10:22:54.7573475Z Running cpp/Dict_test 1/1 ... [2025-12-04 10:22:54.756853][2651.391328174] 2025-12-04T10:22:54.7573959Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:54.7574337Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:54.7574821Z Finished cpp/Dict_test 1/1 ... [2025-12-04 10:22:54.757023][2651.391500607], took 0.00min 2025-12-04T10:22:54.7806389Z Running cpp/Dimname_test 1/1 ... [2025-12-04 10:22:54.780299][2651.414773226] 2025-12-04T10:22:54.7806825Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:54.7807193Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:54.7807691Z Finished cpp/Dimname_test 1/1 ... [2025-12-04 10:22:54.780463][2651.41494234], took 0.00min 2025-12-04T10:22:54.7870055Z Running cpp/NamedTensor_test 1/1 ... [2025-12-04 10:22:54.786729][2651.421197164] 2025-12-04T10:22:54.7870484Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:54.7871224Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:54.7871982Z Finished cpp/NamedTensor_test 1/1 ... [2025-12-04 10:22:54.786926][2651.421404019], took 0.00min 2025-12-04T10:22:58.4521064Z Running cpp/apply_utils_test 1/1 ... [2025-12-04 10:22:58.451622][2655.086096853] 2025-12-04T10:22:58.4521624Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:58.4521973Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:58.4522491Z Finished cpp/apply_utils_test 1/1 ... [2025-12-04 10:22:58.451791][2655.086269647], took 0.00min 2025-12-04T10:22:58.6572525Z Running cpp/atest 1/1 ... [2025-12-04 10:22:58.656792][2655.291266545] 2025-12-04T10:22:58.6573269Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:58.6573894Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:58.6574461Z Finished cpp/atest 1/1 ... [2025-12-04 10:22:58.657001][2655.2914795], took 0.00min 2025-12-04T10:22:58.6882829Z Running cpp/basic 1/1 ... [2025-12-04 10:22:58.687847][2655.32232185] 2025-12-04T10:22:58.6883429Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:22:58.6884038Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:22:58.6884531Z Finished cpp/basic 1/1 ... [2025-12-04 10:22:58.688031][2655.322509944], took 0.00min 2025-12-04T10:23:02.1657983Z Running cpp/broadcast_test 1/1 ... [2025-12-04 10:23:02.165340][2658.799815209] 2025-12-04T10:23:02.1658463Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:02.1658832Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:02.1659356Z Finished cpp/broadcast_test 1/1 ... [2025-12-04 10:23:02.165558][2658.800036234], took 0.00min 2025-12-04T10:23:02.4805969Z Running cpp/cpu_generator_test 1/1 ... [2025-12-04 10:23:02.480165][2659.114640248] 2025-12-04T10:23:02.4806727Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:02.4807273Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:02.4808181Z Finished cpp/cpu_generator_test 1/1 ... [2025-12-04 10:23:02.480345][2659.114824862], took 0.00min 2025-12-04T10:23:02.4855719Z Running cpp/dlconvertor_test 1/1 ... [2025-12-04 10:23:02.485279][2659.119753948] 2025-12-04T10:23:02.4856348Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:02.4857184Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:02.4857913Z Finished cpp/dlconvertor_test 1/1 ... [2025-12-04 10:23:02.485444][2659.119922762], took 0.00min 2025-12-04T10:23:05.8972798Z Running cpp/extension_backend_test 1/1 ... [2025-12-04 10:23:05.896791][2662.531265374] 2025-12-04T10:23:05.8973357Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:05.8973722Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:05.8974266Z Finished cpp/extension_backend_test 1/1 ... [2025-12-04 10:23:05.896962][2662.531441777], took 0.00min 2025-12-04T10:23:06.2828161Z Running cpp/lazy_tensor_test 1/1 ... [2025-12-04 10:23:06.282338][2662.916812987] 2025-12-04T10:23:06.2828676Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:06.2829038Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:06.2829558Z Finished cpp/lazy_tensor_test 1/1 ... [2025-12-04 10:23:06.282514][2662.916993621], took 0.00min 2025-12-04T10:23:06.2902766Z Running cpp/legacy_vmap_test 1/1 ... [2025-12-04 10:23:06.289966][2662.92444068] 2025-12-04T10:23:06.2903461Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:06.2904057Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:06.2904713Z Finished cpp/legacy_vmap_test 1/1 ... [2025-12-04 10:23:06.290140][2662.924619304], took 0.00min 2025-12-04T10:23:09.5532111Z Running cpp/native_test 1/1 ... [2025-12-04 10:23:09.552749][2666.187224513] 2025-12-04T10:23:09.5532707Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:09.5533093Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:09.5533740Z Finished cpp/native_test 1/1 ... [2025-12-04 10:23:09.552924][2666.187404066], took 0.00min 2025-12-04T10:23:10.0726896Z Running cpp/operators_test 1/1 ... [2025-12-04 10:23:10.072270][2666.706744694] 2025-12-04T10:23:10.0727757Z Running cpp/scalar_tensor_test 1/1 ... [2025-12-04 10:23:10.072298][2666.706772484] 2025-12-04T10:23:10.0728191Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:10.0728474Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:10.0728818Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:10.0729332Z Finished cpp/operators_test 1/1 ... [2025-12-04 10:23:10.072451][2666.706930748], took 0.00min 2025-12-04T10:23:10.0729832Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:10.0730334Z Finished cpp/scalar_tensor_test 1/1 ... [2025-12-04 10:23:10.072477][2666.706956618], took 0.00min 2025-12-04T10:23:13.2281021Z Running cpp/scalar_test 1/1 ... [2025-12-04 10:23:13.227638][2669.862112527] 2025-12-04T10:23:13.2281481Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:13.2282164Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:13.2282694Z Finished cpp/scalar_test 1/1 ... [2025-12-04 10:23:13.227801][2669.86228038], took 0.00min 2025-12-04T10:23:13.7568266Z Running cpp/tensor_iterator_test 1/1 ... [2025-12-04 10:23:13.756353][2670.390827155] 2025-12-04T10:23:13.7568999Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:13.7569283Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:13.7569722Z Finished cpp/tensor_iterator_test 1/1 ... [2025-12-04 10:23:13.756525][2670.391004359], took 0.00min 2025-12-04T10:23:13.8663894Z Running cpp/undefined_tensor_test 1/1 ... [2025-12-04 10:23:13.865941][2670.500415231] 2025-12-04T10:23:13.8664458Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:13.8664874Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:13.8665483Z Finished cpp/undefined_tensor_test 1/1 ... [2025-12-04 10:23:13.866124][2670.500602665], took 0.00min 2025-12-04T10:23:16.8862370Z Running cpp/wrapdim_test 1/1 ... [2025-12-04 10:23:16.885759][2673.520234078] 2025-12-04T10:23:16.8863043Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:16.8863542Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:16.8864261Z Finished cpp/wrapdim_test 1/1 ... [2025-12-04 10:23:16.885951][2673.520429923], took 0.00min 2025-12-04T10:23:18.3292226Z Running test batch 'tests to run' cost 28.02 seconds 2025-12-04T10:23:18.9249150Z + run_if_exists tensor_interop_test 2025-12-04T10:23:18.9249504Z + local test_name=tensor_interop_test 2025-12-04T10:23:18.9249771Z + [[ -x build/bin/tensor_interop_test ]] 2025-12-04T10:23:18.9250033Z + echo 'Warning: tensor_interop_test does not exist.' 2025-12-04T10:23:18.9250324Z Warning: tensor_interop_test does not exist. 2025-12-04T10:23:18.9250566Z + run_if_exists cudnn_test 2025-12-04T10:23:18.9250765Z + local test_name=cudnn_test 2025-12-04T10:23:18.9250962Z + [[ -x build/bin/cudnn_test ]] 2025-12-04T10:23:18.9251184Z + echo 'Warning: cudnn_test does not exist.' 2025-12-04T10:23:18.9251448Z Warning: cudnn_test does not exist. 2025-12-04T10:23:18.9251671Z + run_if_exists cuda_generator_test 2025-12-04T10:23:18.9251887Z + local test_name=cuda_generator_test 2025-12-04T10:23:18.9252119Z + [[ -x build/bin/cuda_generator_test ]] 2025-12-04T10:23:18.9252432Z + python test/run_test.py --cpp --verbose -i cpp/cuda_generator_test 2025-12-04T10:23:23.3104264Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T10:23:23.3183139Z Found test times from artifacts 2025-12-04T10:23:23.3504333Z Found test times from artifacts 2025-12-04T10:23:23.3513892Z Running all tests 2025-12-04T10:23:23.3516850Z Running parallel tests on 3 processes 2025-12-04T10:23:23.3517234Z Name: tests to run (est. time: 0.0min) 2025-12-04T10:23:23.3517582Z Serial tests (0): 2025-12-04T10:23:23.3517811Z Parallel tests (1): 2025-12-04T10:23:23.3518046Z cpp/cuda_generator_test 1/1 2025-12-04T10:23:23.3518351Z Name: excluded (est. time: 0.0min) 2025-12-04T10:23:23.3518613Z Serial tests (0): 2025-12-04T10:23:23.3518880Z Parallel tests (0): 2025-12-04T10:23:23.3519566Z Running cpp/cuda_generator_test 1/1 ... [2025-12-04 10:23:23.351732][2679.986211261] 2025-12-04T10:23:23.3520295Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:23.3520663Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:23.3521311Z Finished cpp/cuda_generator_test 1/1 ... [2025-12-04 10:23:23.351891][2679.986371085], took 0.00min 2025-12-04T10:23:24.6100394Z Uploading artifacts took 1.25 seconds 2025-12-04T10:23:27.5873808Z Running cpp/cuda_generator_test 1/1 ... [2025-12-04 10:23:27.586897][2684.221372459] 2025-12-04T10:23:27.5874343Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:27.5874695Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:27.5875235Z Finished cpp/cuda_generator_test 1/1 ... [2025-12-04 10:23:27.587063][2684.221542043], took 0.00min 2025-12-04T10:23:28.4675165Z Running test batch 'tests to run' cost 5.12 seconds 2025-12-04T10:23:29.0567439Z + run_if_exists apply_test 2025-12-04T10:23:29.0567792Z + local test_name=apply_test 2025-12-04T10:23:29.0568119Z + [[ -x build/bin/apply_test ]] 2025-12-04T10:23:29.0568722Z + echo 'Warning: apply_test does not exist.' 2025-12-04T10:23:29.0569051Z Warning: apply_test does not exist. 2025-12-04T10:23:29.0569342Z + run_if_exists stream_test 2025-12-04T10:23:29.0569581Z + local test_name=stream_test 2025-12-04T10:23:29.0569849Z + [[ -x build/bin/stream_test ]] 2025-12-04T10:23:29.0570123Z + echo 'Warning: stream_test does not exist.' 2025-12-04T10:23:29.0570408Z Warning: stream_test does not exist. 2025-12-04T10:23:29.0570677Z + run_if_exists cuda_half_test 2025-12-04T10:23:29.0570920Z + local test_name=cuda_half_test 2025-12-04T10:23:29.0571186Z + [[ -x build/bin/cuda_half_test ]] 2025-12-04T10:23:29.0571533Z + python test/run_test.py --cpp --verbose -i cpp/cuda_half_test 2025-12-04T10:23:33.4627984Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T10:23:33.4706878Z Found test times from artifacts 2025-12-04T10:23:33.5032793Z Found test times from artifacts 2025-12-04T10:23:33.5041982Z Running all tests 2025-12-04T10:23:33.5044978Z Running parallel tests on 3 processes 2025-12-04T10:23:33.5045329Z Name: tests to run (est. time: 0.0min) 2025-12-04T10:23:33.5045734Z Serial tests (0): 2025-12-04T10:23:33.5045969Z Parallel tests (1): 2025-12-04T10:23:33.5046196Z cpp/cuda_half_test 1/1 2025-12-04T10:23:33.5046523Z Name: excluded (est. time: 0.0min) 2025-12-04T10:23:33.5046966Z Serial tests (0): 2025-12-04T10:23:33.5047177Z Parallel tests (0): 2025-12-04T10:23:33.5047511Z Running cpp/cuda_half_test 1/1 ... [2025-12-04 10:23:33.504482][2690.138961103] 2025-12-04T10:23:33.5047919Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:33.5048274Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:33.5048793Z Finished cpp/cuda_half_test 1/1 ... [2025-12-04 10:23:33.504611][2690.139091866], took 0.00min 2025-12-04T10:23:34.9359943Z Uploading artifacts took 1.42 seconds 2025-12-04T10:23:37.9900047Z Running cpp/cuda_half_test 1/1 ... [2025-12-04 10:23:37.989525][2694.624000169] 2025-12-04T10:23:37.9900540Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:37.9900896Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:37.9901399Z Finished cpp/cuda_half_test 1/1 ... [2025-12-04 10:23:37.989694][2694.624174712], took 0.00min 2025-12-04T10:23:38.8208774Z Running test batch 'tests to run' cost 5.32 seconds 2025-12-04T10:23:39.4053446Z + run_if_exists cuda_vectorized_test 2025-12-04T10:23:39.4053845Z + local test_name=cuda_vectorized_test 2025-12-04T10:23:39.4054150Z + [[ -x build/bin/cuda_vectorized_test ]] 2025-12-04T10:23:39.4054558Z + python test/run_test.py --cpp --verbose -i cpp/cuda_vectorized_test 2025-12-04T10:23:43.7304936Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T10:23:43.7383964Z Found test times from artifacts 2025-12-04T10:23:43.7704549Z Found test times from artifacts 2025-12-04T10:23:43.7713938Z Running all tests 2025-12-04T10:23:43.7716714Z Running parallel tests on 3 processes 2025-12-04T10:23:43.7716986Z Name: tests to run (est. time: 0.0min) 2025-12-04T10:23:43.7717257Z Serial tests (0): 2025-12-04T10:23:43.7717466Z Parallel tests (1): 2025-12-04T10:23:43.7717700Z cpp/cuda_vectorized_test 1/1 2025-12-04T10:23:43.7717960Z Name: excluded (est. time: 0.0min) 2025-12-04T10:23:43.7718199Z Serial tests (0): 2025-12-04T10:23:43.7718402Z Parallel tests (0): 2025-12-04T10:23:43.7718975Z Running cpp/cuda_vectorized_test 1/1 ... [2025-12-04 10:23:43.771701][2700.406179282] 2025-12-04T10:23:43.7719387Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:43.7719740Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:43.7720424Z Finished cpp/cuda_vectorized_test 1/1 ... [2025-12-04 10:23:43.771831][2700.406311365], took 0.00min 2025-12-04T10:23:45.0840337Z Uploading artifacts took 1.30 seconds 2025-12-04T10:23:48.0556652Z Running cpp/cuda_vectorized_test 1/1 ... [2025-12-04 10:23:48.055145][2704.689619917] 2025-12-04T10:23:48.0557480Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:48.0557849Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:48.0558388Z Finished cpp/cuda_vectorized_test 1/1 ... [2025-12-04 10:23:48.055322][2704.689801591], took 0.00min 2025-12-04T10:23:48.9291083Z Running test batch 'tests to run' cost 5.16 seconds 2025-12-04T10:23:49.5119226Z + run_if_exists cuda_distributions_test 2025-12-04T10:23:49.5119606Z + local test_name=cuda_distributions_test 2025-12-04T10:23:49.5119940Z + [[ -x build/bin/cuda_distributions_test ]] 2025-12-04T10:23:49.5120364Z + python test/run_test.py --cpp --verbose -i cpp/cuda_distributions_test 2025-12-04T10:23:53.9052010Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T10:23:53.9131186Z Found test times from artifacts 2025-12-04T10:23:53.9451335Z Found test times from artifacts 2025-12-04T10:23:53.9460890Z Running all tests 2025-12-04T10:23:53.9463908Z Running parallel tests on 3 processes 2025-12-04T10:23:53.9464224Z Name: tests to run (est. time: 0.0min) 2025-12-04T10:23:53.9464500Z Serial tests (0): 2025-12-04T10:23:53.9464850Z Parallel tests (1): 2025-12-04T10:23:53.9465106Z cpp/cuda_distributions_test 1/1 2025-12-04T10:23:53.9465388Z Name: excluded (est. time: 0.0min) 2025-12-04T10:23:53.9465633Z Serial tests (0): 2025-12-04T10:23:53.9465916Z Parallel tests (0): 2025-12-04T10:23:53.9466742Z Running cpp/cuda_distributions_test 1/1 ... [2025-12-04 10:23:53.946420][2710.580899777] 2025-12-04T10:23:53.9467374Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:53.9467914Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:53.9468487Z Finished cpp/cuda_distributions_test 1/1 ... [2025-12-04 10:23:53.946573][2710.5810531], took 0.00min 2025-12-04T10:23:55.2634601Z Uploading artifacts took 1.31 seconds 2025-12-04T10:23:58.3369592Z Running cpp/cuda_distributions_test 1/1 ... [2025-12-04 10:23:58.336503][2714.970977202] 2025-12-04T10:23:58.3370204Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:23:58.3370572Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:23:58.3371118Z Finished cpp/cuda_distributions_test 1/1 ... [2025-12-04 10:23:58.336675][2714.971155016], took 0.00min 2025-12-04T10:23:59.2127103Z Running test batch 'tests to run' cost 5.27 seconds 2025-12-04T10:23:59.8088821Z + run_if_exists cuda_optional_test 2025-12-04T10:23:59.8089334Z + local test_name=cuda_optional_test 2025-12-04T10:23:59.8089805Z + [[ -x build/bin/cuda_optional_test ]] 2025-12-04T10:23:59.8090433Z + python test/run_test.py --cpp --verbose -i cpp/cuda_optional_test 2025-12-04T10:24:04.2313395Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T10:24:04.2392109Z Found test times from artifacts 2025-12-04T10:24:04.2713990Z Found test times from artifacts 2025-12-04T10:24:04.2722983Z Running all tests 2025-12-04T10:24:04.2726068Z Running parallel tests on 3 processes 2025-12-04T10:24:04.2726344Z Name: tests to run (est. time: 0.0min) 2025-12-04T10:24:04.2726601Z Serial tests (0): 2025-12-04T10:24:04.2726856Z Parallel tests (1): 2025-12-04T10:24:04.2727112Z cpp/cuda_optional_test 1/1 2025-12-04T10:24:04.2727391Z Name: excluded (est. time: 0.0min) 2025-12-04T10:24:04.2727661Z Serial tests (0): 2025-12-04T10:24:04.2727893Z Parallel tests (0): 2025-12-04T10:24:04.2728288Z Running cpp/cuda_optional_test 1/1 ... [2025-12-04 10:24:04.272618][2720.907098212] 2025-12-04T10:24:04.2728627Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:24:04.2728896Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:24:04.2729465Z Finished cpp/cuda_optional_test 1/1 ... [2025-12-04 10:24:04.272745][2720.907226455], took 0.00min 2025-12-04T10:24:05.5769212Z Uploading artifacts took 1.29 seconds 2025-12-04T10:24:08.5820139Z Running cpp/cuda_optional_test 1/1 ... [2025-12-04 10:24:08.581532][2725.216006065] 2025-12-04T10:24:08.5821007Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:24:08.5821361Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:24:08.5821881Z Finished cpp/cuda_optional_test 1/1 ... [2025-12-04 10:24:08.581704][2725.216183239], took 0.00min 2025-12-04T10:24:09.4699717Z Running test batch 'tests to run' cost 5.2 seconds 2025-12-04T10:24:10.0489439Z + run_if_exists cuda_tensor_interop_test 2025-12-04T10:24:10.0489833Z + local test_name=cuda_tensor_interop_test 2025-12-04T10:24:10.0490184Z + [[ -x build/bin/cuda_tensor_interop_test ]] 2025-12-04T10:24:10.0490572Z + echo 'Warning: cuda_tensor_interop_test does not exist.' 2025-12-04T10:24:10.0490972Z Warning: cuda_tensor_interop_test does not exist. 2025-12-04T10:24:10.0491309Z + run_if_exists cuda_complex_test 2025-12-04T10:24:10.0491579Z + local test_name=cuda_complex_test 2025-12-04T10:24:10.0491867Z + [[ -x build/bin/cuda_complex_test ]] 2025-12-04T10:24:10.0492547Z + python test/run_test.py --cpp --verbose -i cpp/cuda_complex_test 2025-12-04T10:24:14.4379480Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T10:24:14.4459449Z Found test times from artifacts 2025-12-04T10:24:14.4787763Z Found test times from artifacts 2025-12-04T10:24:14.4797059Z Running all tests 2025-12-04T10:24:14.4799884Z Running parallel tests on 3 processes 2025-12-04T10:24:14.4800238Z Name: tests to run (est. time: 0.0min) 2025-12-04T10:24:14.4800621Z Serial tests (0): 2025-12-04T10:24:14.4800858Z Parallel tests (1): 2025-12-04T10:24:14.4801084Z cpp/cuda_complex_test 1/1 2025-12-04T10:24:14.4801489Z Name: excluded (est. time: 0.0min) 2025-12-04T10:24:14.4801976Z Serial tests (0): 2025-12-04T10:24:14.4802339Z Parallel tests (0): 2025-12-04T10:24:14.4802789Z Running cpp/cuda_complex_test 1/1 ... [2025-12-04 10:24:14.480023][2731.114502177] 2025-12-04T10:24:14.4803375Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:24:14.4803737Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:24:14.4804239Z Finished cpp/cuda_complex_test 1/1 ... [2025-12-04 10:24:14.480167][2731.11464707], took 0.00min 2025-12-04T10:24:15.7654695Z Uploading artifacts took 1.27 seconds 2025-12-04T10:24:18.7930328Z Running cpp/cuda_complex_test 1/1 ... [2025-12-04 10:24:18.792574][2735.427049131] 2025-12-04T10:24:18.7930844Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:24:18.7931360Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:24:18.7931881Z Finished cpp/cuda_complex_test 1/1 ... [2025-12-04 10:24:18.792740][2735.427219574], took 0.00min 2025-12-04T10:24:19.6326969Z Running test batch 'tests to run' cost 5.15 seconds 2025-12-04T10:24:20.2136623Z + run_if_exists cuda_complex_math_test 2025-12-04T10:24:20.2136996Z + local test_name=cuda_complex_math_test 2025-12-04T10:24:20.2137621Z + [[ -x build/bin/cuda_complex_math_test ]] 2025-12-04T10:24:20.2138094Z + python test/run_test.py --cpp --verbose -i cpp/cuda_complex_math_test 2025-12-04T10:24:24.6110308Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T10:24:24.6189795Z Found test times from artifacts 2025-12-04T10:24:24.6510465Z Found test times from artifacts 2025-12-04T10:24:24.6519652Z Running all tests 2025-12-04T10:24:24.6522574Z Running parallel tests on 3 processes 2025-12-04T10:24:24.6522846Z Name: tests to run (est. time: 0.0min) 2025-12-04T10:24:24.6523064Z Serial tests (0): 2025-12-04T10:24:24.6523243Z Parallel tests (1): 2025-12-04T10:24:24.6523430Z cpp/cuda_complex_math_test 1/1 2025-12-04T10:24:24.6530957Z Name: excluded (est. time: 0.0min) 2025-12-04T10:24:24.6531248Z Serial tests (0): 2025-12-04T10:24:24.6531430Z Parallel tests (0): 2025-12-04T10:24:24.6531748Z Running cpp/cuda_complex_math_test 1/1 ... [2025-12-04 10:24:24.652283][2741.286761198] 2025-12-04T10:24:24.6532214Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:24:24.6532489Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:24:24.6532900Z Finished cpp/cuda_complex_math_test 1/1 ... [2025-12-04 10:24:24.652432][2741.286911381], took 0.00min 2025-12-04T10:24:25.9878851Z Uploading artifacts took 1.32 seconds 2025-12-04T10:24:29.0054189Z Running cpp/cuda_complex_math_test 1/1 ... [2025-12-04 10:24:29.004929][2745.639404701] 2025-12-04T10:24:29.0054692Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:24:29.0055050Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:24:29.0055867Z Finished cpp/cuda_complex_math_test 1/1 ... [2025-12-04 10:24:29.005128][2745.639607035], took 0.00min 2025-12-04T10:24:29.8552225Z Running test batch 'tests to run' cost 5.2 seconds 2025-12-04T10:24:30.4395600Z + run_if_exists cuda_cub_test 2025-12-04T10:24:30.4395893Z + local test_name=cuda_cub_test 2025-12-04T10:24:30.4396150Z + [[ -x build/bin/cuda_cub_test ]] 2025-12-04T10:24:30.4396758Z + python test/run_test.py --cpp --verbose -i cpp/cuda_cub_test 2025-12-04T10:24:34.8069941Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T10:24:34.8148816Z Found test times from artifacts 2025-12-04T10:24:34.8468373Z Found test times from artifacts 2025-12-04T10:24:34.8476980Z Running all tests 2025-12-04T10:24:34.8480023Z Running parallel tests on 3 processes 2025-12-04T10:24:34.8480505Z Name: tests to run (est. time: 0.0min) 2025-12-04T10:24:34.8480786Z Serial tests (0): 2025-12-04T10:24:34.8481004Z Parallel tests (1): 2025-12-04T10:24:34.8481248Z cpp/cuda_cub_test 1/1 2025-12-04T10:24:34.8481685Z Name: excluded (est. time: 0.0min) 2025-12-04T10:24:34.8482008Z Serial tests (0): 2025-12-04T10:24:34.8482360Z Parallel tests (0): 2025-12-04T10:24:34.8482727Z Running cpp/cuda_cub_test 1/1 ... [2025-12-04 10:24:34.848012][2751.482491872] 2025-12-04T10:24:34.8483272Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:24:34.8483590Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:24:34.8483986Z Finished cpp/cuda_cub_test 1/1 ... [2025-12-04 10:24:34.848143][2751.482623575], took 0.00min 2025-12-04T10:24:36.3206789Z Uploading artifacts took 1.46 seconds 2025-12-04T10:24:39.4097463Z Running cpp/cuda_cub_test 1/1 ... [2025-12-04 10:24:39.409282][2756.043756652] 2025-12-04T10:24:39.4097993Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:24:39.4098358Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:24:39.4098869Z Finished cpp/cuda_cub_test 1/1 ... [2025-12-04 10:24:39.409467][2756.043945166], took 0.00min 2025-12-04T10:24:40.2558148Z Running test batch 'tests to run' cost 5.41 seconds 2025-12-04T10:24:40.8438499Z + run_if_exists cuda_atomic_ops_test 2025-12-04T10:24:40.8438872Z + local test_name=cuda_atomic_ops_test 2025-12-04T10:24:40.8439177Z + [[ -x build/bin/cuda_atomic_ops_test ]] 2025-12-04T10:24:40.8439867Z + python test/run_test.py --cpp --verbose -i cpp/cuda_atomic_ops_test 2025-12-04T10:24:45.2648375Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T10:24:45.2727693Z Found test times from artifacts 2025-12-04T10:24:45.3047364Z Found test times from artifacts 2025-12-04T10:24:45.3056664Z Running all tests 2025-12-04T10:24:45.3060107Z Running parallel tests on 3 processes 2025-12-04T10:24:45.3060436Z Name: tests to run (est. time: 0.0min) 2025-12-04T10:24:45.3060722Z Serial tests (0): 2025-12-04T10:24:45.3060935Z Parallel tests (1): 2025-12-04T10:24:45.3061187Z cpp/cuda_atomic_ops_test 1/1 2025-12-04T10:24:45.3061592Z Name: excluded (est. time: 0.0min) 2025-12-04T10:24:45.3062090Z Serial tests (0): 2025-12-04T10:24:45.3062464Z Parallel tests (0): 2025-12-04T10:24:45.3062960Z Running cpp/cuda_atomic_ops_test 1/1 ... [2025-12-04 10:24:45.306035][2761.940513864] 2025-12-04T10:24:45.3063384Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:24:45.3063890Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:24:45.3064413Z Finished cpp/cuda_atomic_ops_test 1/1 ... [2025-12-04 10:24:45.306189][2761.940669097], took 0.00min 2025-12-04T10:24:46.6801175Z Uploading artifacts took 1.36 seconds 2025-12-04T10:24:49.6336388Z Running cpp/cuda_atomic_ops_test 1/1 ... [2025-12-04 10:24:49.633135][2766.267608905] 2025-12-04T10:24:49.6336877Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:24:49.6337235Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:24:49.6337776Z Finished cpp/cuda_atomic_ops_test 1/1 ... [2025-12-04 10:24:49.633345][2766.267823439], took 0.00min 2025-12-04T10:24:50.5533983Z Running test batch 'tests to run' cost 5.25 seconds 2025-12-04T10:24:51.1311268Z + run_if_exists cuda_allocator_test 2025-12-04T10:24:51.1311589Z + local test_name=cuda_allocator_test 2025-12-04T10:24:51.1311846Z + [[ -x build/bin/cuda_allocator_test ]] 2025-12-04T10:24:51.1312177Z + python test/run_test.py --cpp --verbose -i cpp/cuda_allocator_test 2025-12-04T10:24:55.5255443Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T10:24:55.5334138Z Found test times from artifacts 2025-12-04T10:24:55.5657155Z Found test times from artifacts 2025-12-04T10:24:55.5665780Z Running all tests 2025-12-04T10:24:55.5669145Z Running parallel tests on 3 processes 2025-12-04T10:24:55.5669502Z Name: tests to run (est. time: 0.0min) 2025-12-04T10:24:55.5669983Z Serial tests (0): 2025-12-04T10:24:55.5670258Z Parallel tests (1): 2025-12-04T10:24:55.5670542Z cpp/cuda_allocator_test 1/1 2025-12-04T10:24:55.5671228Z Name: excluded (est. time: 0.0min) 2025-12-04T10:24:55.5671521Z Serial tests (0): 2025-12-04T10:24:55.5671746Z Parallel tests (0): 2025-12-04T10:24:55.5672255Z Running cpp/cuda_allocator_test 1/1 ... [2025-12-04 10:24:55.566927][2772.201405531] 2025-12-04T10:24:55.5672701Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:24:55.5673061Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:24:55.5673584Z Finished cpp/cuda_allocator_test 1/1 ... [2025-12-04 10:24:55.567055][2772.201535893], took 0.00min 2025-12-04T10:24:56.9947209Z Uploading artifacts took 1.42 seconds 2025-12-04T10:25:00.0357628Z Running cpp/cuda_allocator_test 1/1 ... [2025-12-04 10:25:00.035232][2776.669706635] 2025-12-04T10:25:00.0358136Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:25:00.0358515Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:25:00.0359044Z Finished cpp/cuda_allocator_test 1/1 ... [2025-12-04 10:25:00.035436][2776.669913819], took 0.00min 2025-12-04T10:25:00.8753975Z Running test batch 'tests to run' cost 5.31 seconds 2025-12-04T10:25:01.4576094Z + '[' ON == ON ']' 2025-12-04T10:25:01.4577167Z + valgrind --suppressions=/var/lib/jenkins/workspace/aten/tools/valgrind.sup --error-exitcode=1 build/bin/basic '--gtest_filter=-*CUDA' 2025-12-04T10:25:01.4687152Z ==38493== Memcheck, a memory error detector 2025-12-04T10:25:01.4687593Z ==38493== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. 2025-12-04T10:25:01.4688107Z ==38493== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info 2025-12-04T10:25:01.4688554Z ==38493== Command: build/bin/basic --gtest_filter=-*CUDA 2025-12-04T10:25:01.4688860Z ==38493== 2025-12-04T10:25:01.7866695Z ==38493== Warning: set address range perms: large range [0x4a9d000, 0x1a0d6000) (defined) 2025-12-04T10:25:01.7867429Z ==38493== Warning: set address range perms: large range [0x5ab8000, 0x16ce8000) (defined) 2025-12-04T10:25:04.4969121Z ==38493== Warning: set address range perms: large range [0x1a0d6000, 0x2d8f6000) (defined) 2025-12-04T10:25:07.8795595Z ==38493== Warning: set address range perms: large range [0x2f38a000, 0x4683f000) (noaccess) 2025-12-04T10:25:07.8796281Z ==38493== Warning: set address range perms: large range [0x2f400000, 0x466b5000) (defined) 2025-12-04T10:25:07.9219362Z ==38493== Warning: set address range perms: large range [0x466b5000, 0x57a7a000) (noaccess) 2025-12-04T10:25:07.9220053Z ==38493== Warning: set address range perms: large range [0x46800000, 0x579c5000) (defined) 2025-12-04T10:25:07.9516614Z ==38493== Warning: set address range perms: large range [0x477bd000, 0x579c5000) (defined) 2025-12-04T10:25:07.9630767Z ==38493== Warning: set address range perms: large range [0x59c92000, 0x73439000) (defined) 2025-12-04T10:25:07.9632433Z ==38493== Warning: set address range perms: large range [0x5ec47000, 0x73083000) (defined) 2025-12-04T10:25:08.7278145Z ==38493== Warning: set address range perms: large range [0x73439000, 0x89eab000) (defined) 2025-12-04T10:25:08.7278843Z ==38493== Warning: set address range perms: large range [0x735ef000, 0x89e54000) (defined) 2025-12-04T10:25:08.8449407Z ==38493== Warning: set address range perms: large range [0x99d0b000, 0xcc1fd000) (noaccess) 2025-12-04T10:25:08.8450026Z ==38493== Warning: set address range perms: large range [0x99e00000, 0xcc0f2000) (defined) 2025-12-04T10:25:54.5759841Z Running main() from /var/lib/jenkins/workspace/third_party/googletest/googletest/src/gtest_main.cc 2025-12-04T10:25:54.6006129Z Note: Google Test filter = -*CUDA 2025-12-04T10:25:54.6066218Z [==========] Running 6 tests from 1 test suite. 2025-12-04T10:25:54.6080127Z [----------] Global test environment set-up. 2025-12-04T10:25:54.6129871Z [----------] 6 tests from BasicTest 2025-12-04T10:25:54.6149948Z [ RUN ] BasicTest.BasicTestCPU 2025-12-04T10:25:54.8481884Z ==38493== Warning: noted but unhandled ioctl 0x30000001 with no size/direction hints. 2025-12-04T10:25:54.8482511Z ==38493== This could cause spurious value errors to appear. 2025-12-04T10:25:54.8483150Z ==38493== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. 2025-12-04T10:25:54.8487304Z ==38493== Warning: noted but unhandled ioctl 0x4b with no size/direction hints. 2025-12-04T10:25:54.8487837Z ==38493== This could cause spurious value errors to appear. 2025-12-04T10:25:54.8488348Z ==38493== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. 2025-12-04T10:25:54.8493883Z ==38493== Warning: noted but unhandled ioctl 0x27 with no size/direction hints. 2025-12-04T10:25:54.8494414Z ==38493== This could cause spurious value errors to appear. 2025-12-04T10:25:54.8494895Z ==38493== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. 2025-12-04T10:25:54.9929947Z ==38493== Warning: noted but unhandled ioctl 0x25 with no size/direction hints. 2025-12-04T10:25:54.9930429Z ==38493== This could cause spurious value errors to appear. 2025-12-04T10:25:54.9930910Z ==38493== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. 2025-12-04T10:25:55.0934490Z ==38493== Warning: noted but unhandled ioctl 0x46 with no size/direction hints. 2025-12-04T10:25:55.0934981Z ==38493== This could cause spurious value errors to appear. 2025-12-04T10:25:55.0935697Z ==38493== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. 2025-12-04T10:25:55.0942466Z ==38493== Warning: noted but unhandled ioctl 0x17 with no size/direction hints. 2025-12-04T10:25:55.0942841Z ==38493== This could cause spurious value errors to appear. 2025-12-04T10:25:55.0943221Z ==38493== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. 2025-12-04T10:25:55.1036381Z ==38493== Warning: set address range perms: large range [0x200000000, 0x300200000) (noaccess) 2025-12-04T10:25:56.8410510Z 1100 ms 2025-12-04T10:25:57.1870989Z 56 ms 2025-12-04T10:25:57.3226420Z 70 ms 2025-12-04T10:26:00.1078045Z [ OK ] BasicTest.BasicTestCPU (5490 ms) 2025-12-04T10:26:00.1082575Z [ RUN ] BasicTest.BasicTestHalfCPU 2025-12-04T10:26:00.6186125Z 472 ms 2025-12-04T10:26:00.7865857Z 46 ms 2025-12-04T10:26:00.9358958Z 71 ms 2025-12-04T10:26:00.9866778Z [ OK ] BasicTest.BasicTestHalfCPU (878 ms) 2025-12-04T10:26:01.0023925Z [ RUN ] BasicTest.FactoryMethodsTest 2025-12-04T10:26:01.0903750Z ==38493== Warning: noted but unhandled ioctl 0x19 with no size/direction hints. 2025-12-04T10:26:01.0904333Z ==38493== This could cause spurious value errors to appear. 2025-12-04T10:26:01.0904841Z ==38493== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. 2025-12-04T10:26:01.1173591Z ==38493== Warning: noted but unhandled ioctl 0x49 with no size/direction hints. 2025-12-04T10:26:01.1174298Z ==38493== This could cause spurious value errors to appear. 2025-12-04T10:26:01.1174783Z ==38493== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. 2025-12-04T10:26:01.1182261Z ==38493== Warning: noted but unhandled ioctl 0x21 with no size/direction hints. 2025-12-04T10:26:01.1182893Z ==38493== This could cause spurious value errors to appear. 2025-12-04T10:26:01.1183364Z ==38493== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. 2025-12-04T10:26:01.2408496Z ==38493== Warning: noted but unhandled ioctl 0x1b with no size/direction hints. 2025-12-04T10:26:01.2409285Z ==38493== This could cause spurious value errors to appear. 2025-12-04T10:26:01.2409773Z ==38493== See README_MISSING_SYSCALL_OR_IOCTL for guidance on writing a proper wrapper. 2025-12-04T10:26:02.2310095Z [ OK ] BasicTest.FactoryMethodsTest (1228 ms) 2025-12-04T10:26:02.2310565Z [ RUN ] BasicTest.BasicStdTestCPU 2025-12-04T10:26:02.2769667Z Simple example: called once 2025-12-04T10:26:02.3559176Z throw: call_once will retry 2025-12-04T10:26:02.3572696Z throw: call_once will retry 2025-12-04T10:26:02.3575387Z throw: call_once will retry 2025-12-04T10:26:02.3580186Z Didn't throw, call_once will not attempt again 2025-12-04T10:26:02.3602882Z [ OK ] BasicTest.BasicStdTestCPU (129 ms) 2025-12-04T10:26:02.3603286Z [ RUN ] BasicTest.TestForBlobResizeCPU 2025-12-04T10:26:02.3792376Z [ OK ] BasicTest.TestForBlobResizeCPU (18 ms) 2025-12-04T10:26:02.3792866Z [ RUN ] BasicTest.TestForBlobStridesResizeCPU 2025-12-04T10:26:02.3838857Z [ OK ] BasicTest.TestForBlobStridesResizeCPU (4 ms) 2025-12-04T10:26:02.3856025Z [----------] 6 tests from BasicTest (7769 ms total) 2025-12-04T10:26:02.3856357Z 2025-12-04T10:26:02.3867712Z [----------] Global test environment tear-down 2025-12-04T10:26:02.3888302Z [==========] 6 tests from 1 test suite ran. (7791 ms total) 2025-12-04T10:26:02.3899926Z [ PASSED ] 6 tests. 2025-12-04T10:26:08.4144189Z ==38493== 2025-12-04T10:26:08.4158518Z ==38493== HEAP SUMMARY: 2025-12-04T10:26:08.4158900Z ==38493== in use at exit: 22,414,310 bytes in 24,816 blocks 2025-12-04T10:26:08.4159368Z ==38493== total heap usage: 1,059,935 allocs, 1,035,119 frees, 280,366,278 bytes allocated 2025-12-04T10:26:08.4159771Z ==38493== 2025-12-04T10:26:09.9398596Z ==38493== LEAK SUMMARY: 2025-12-04T10:26:09.9398943Z ==38493== definitely lost: 288 bytes in 3 blocks 2025-12-04T10:26:09.9399293Z ==38493== indirectly lost: 192 bytes in 2 blocks 2025-12-04T10:26:09.9399915Z ==38493== possibly lost: 97,200 bytes in 186 blocks 2025-12-04T10:26:09.9400317Z ==38493== still reachable: 22,316,630 bytes in 24,625 blocks 2025-12-04T10:26:09.9400667Z ==38493== suppressed: 0 bytes in 0 blocks 2025-12-04T10:26:09.9401061Z ==38493== Rerun with --leak-check=full to see details of leaked memory 2025-12-04T10:26:09.9401409Z ==38493== 2025-12-04T10:26:09.9401703Z ==38493== For lists of detected and suppressed errors, rerun with: -s 2025-12-04T10:26:09.9402149Z ==38493== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 4 from 4) 2025-12-04T10:26:10.1528538Z + [[ -x build/bin/tensor_interop_test ]] 2025-12-04T10:26:10.1530911Z + [[ -n '' ]] 2025-12-04T10:26:10.1531141Z + assert_git_not_dirty 2025-12-04T10:26:10.1531798Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug != *rocm* ]] 2025-12-04T10:26:10.1532228Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug != *xla* ]] 2025-12-04T10:26:10.1538688Z ++ git status --porcelain 2025-12-04T10:26:10.1539868Z ++ grep -v '?? third_party' 2025-12-04T10:26:10.5283531Z ++ true 2025-12-04T10:26:10.5285911Z + git_status= 2025-12-04T10:26:10.5286324Z + [[ -n '' ]] 2025-12-04T10:26:10.5286679Z + test_libtorch 1 2025-12-04T10:26:10.5286897Z + local SHARD=1 2025-12-04T10:26:10.5287112Z + [[ default != \s\l\o\w ]] 2025-12-04T10:26:10.5287359Z + echo 'Testing libtorch' 2025-12-04T10:26:10.5287578Z Testing libtorch 2025-12-04T10:26:10.5288268Z + ln -sf /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libbackend_with_compiler.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T10:26:10.5317904Z + ln -sf /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libjitbackend_test.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T10:26:10.5332648Z + ln -sf /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libcaffe2_nvrtc.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T10:26:10.5346995Z + ln -sf /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10_cuda.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libc10d_cuda_test.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T10:26:10.5362467Z + ln -sf /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libshm /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libshm.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libshm_windows /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T10:26:10.5377460Z + ln -sf /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_cuda_linalg.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_global_deps.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_nvshmem.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorch_python.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libtorchbind_test.so /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T10:26:10.5392449Z + ln -sf '/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/lib/libnvfuser*' /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T10:26:10.5406956Z + export CPP_TESTS_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T10:26:10.5407594Z + CPP_TESTS_DIR=/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/bin 2025-12-04T10:26:10.5408000Z + [[ -z 1 ]] 2025-12-04T10:26:10.5408189Z + [[ 1 == \1 ]] 2025-12-04T10:26:10.5408391Z + test_libtorch_api 2025-12-04T10:26:10.5408678Z + MNIST_DIR=/var/lib/jenkins/workspace/test/cpp/api/mnist 2025-12-04T10:26:10.5409352Z + python tools/download_mnist.py --quiet -d /var/lib/jenkins/workspace/test/cpp/api/mnist 2025-12-04T10:26:10.5842674Z Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz ... 2025-12-04T10:26:10.9169947Z Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz ... 2025-12-04T10:26:10.9659065Z Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz ... 2025-12-04T10:26:11.0610148Z Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz ... 2025-12-04T10:26:11.1174464Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *asan* ]] 2025-12-04T10:26:11.1174931Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *slow-gradcheck* ]] 2025-12-04T10:26:11.1175310Z + OMP_NUM_THREADS=2 2025-12-04T10:26:11.1175843Z + TORCH_CPP_TEST_MNIST_PATH=/var/lib/jenkins/workspace/test/cpp/api/mnist 2025-12-04T10:26:11.1176394Z + python test/run_test.py --cpp --verbose -i cpp/test_api -k 'not IMethodTest' 2025-12-04T10:26:15.4052723Z Downloading https://ossci-metrics.s3.amazonaws.com/disabled-tests-condensed.json to /var/lib/jenkins/workspace/test/.pytorch-disabled-tests.json 2025-12-04T10:26:15.4131441Z Found test times from artifacts 2025-12-04T10:26:15.4450513Z Found test times from artifacts 2025-12-04T10:26:15.4459277Z Running all tests 2025-12-04T10:26:15.4462024Z Running parallel tests on 3 processes 2025-12-04T10:26:15.4462367Z Name: tests to run (est. time: 0.0min) 2025-12-04T10:26:15.4462625Z Serial tests (0): 2025-12-04T10:26:15.4462840Z Parallel tests (1): 2025-12-04T10:26:15.4463059Z cpp/test_api 1/1 2025-12-04T10:26:15.4463282Z Name: excluded (est. time: 0.0min) 2025-12-04T10:26:15.4463529Z Serial tests (0): 2025-12-04T10:26:15.4463730Z Parallel tests (0): 2025-12-04T10:26:15.4468542Z Running cpp/test_api 1/1 ... [2025-12-04 10:26:15.446646][2852.081124744] 2025-12-04T10:26:15.4469171Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:26:15.4469549Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:26:15.4470043Z Finished cpp/test_api 1/1 ... [2025-12-04 10:26:15.446800][2852.081280677], took 0.00min 2025-12-04T10:26:16.8090004Z Uploading artifacts took 1.35 seconds 2025-12-04T10:26:19.8138216Z Running cpp/test_api 1/1 ... [2025-12-04 10:26:19.813317][2856.447792587] 2025-12-04T10:26:19.8138924Z SCRIBE_GRAPHQL_ACCESS_TOKEN is set 2025-12-04T10:26:19.8139326Z Skipping C++ tests when running under RERUN_DISABLED_TESTS mode 2025-12-04T10:26:19.8139814Z Finished cpp/test_api 1/1 ... [2025-12-04 10:26:19.813482][2856.4479606], took 0.00min 2025-12-04T10:26:20.6392292Z Running test batch 'tests to run' cost 5.19 seconds 2025-12-04T10:26:21.2249346Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug != *android* ]] 2025-12-04T10:26:21.2250112Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug != *cuda* ]] 2025-12-04T10:26:21.2250467Z + [[ -z 1 ]] 2025-12-04T10:26:21.2250671Z + [[ 1 == \2 ]] 2025-12-04T10:26:21.2250875Z + assert_git_not_dirty 2025-12-04T10:26:21.2251165Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug != *rocm* ]] 2025-12-04T10:26:21.2251550Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug != *xla* ]] 2025-12-04T10:26:21.2256090Z ++ git status --porcelain 2025-12-04T10:26:21.2257477Z ++ grep -v '?? third_party' 2025-12-04T10:26:21.5954897Z ++ true 2025-12-04T10:26:21.5956470Z + git_status= 2025-12-04T10:26:21.5957207Z + [[ -n '' ]] 2025-12-04T10:26:21.5959232Z + [[ linux-jammy-cuda12.8-py3.10-gcc11-debug == *xpu* ]] 2025-12-04T10:26:21.5961283Z + sccache_epilogue 2025-12-04T10:26:21.5962106Z + echo '::group::Sccache Compilation Log' 2025-12-04T10:26:21.5962934Z ##[group]Sccache Compilation Log 2025-12-04T10:26:21.5963262Z + echo '=================== sccache compilation log ===================' 2025-12-04T10:26:21.5963644Z =================== sccache compilation log =================== 2025-12-04T10:26:21.5964217Z + python /var/lib/jenkins/workspace/.ci/pytorch/print_sccache_log.py /var/lib/jenkins/sccache_error.log 2025-12-04T10:26:21.6087938Z + echo '=========== If your build fails, please take a look at the log above for possible reasons ===========' 2025-12-04T10:26:21.6088593Z =========== If your build fails, please take a look at the log above for possible reasons =========== 2025-12-04T10:26:21.6089038Z + sccache --show-stats 2025-12-04T10:26:21.6123394Z Compile requests 252 2025-12-04T10:26:21.6123709Z Compile requests executed 118 2025-12-04T10:26:21.6124021Z Cache hits 74 2025-12-04T10:26:21.6124282Z Cache hits (C/C++) 74 2025-12-04T10:26:21.6124534Z Cache misses 44 2025-12-04T10:26:21.6124784Z Cache misses (C/C++) 44 2025-12-04T10:26:21.6125052Z Cache hits rate 62.71 % 2025-12-04T10:26:21.6125322Z Cache hits rate (C/C++) 62.71 % 2025-12-04T10:26:21.6125575Z Cache timeouts 0 2025-12-04T10:26:21.6125990Z Cache read errors 0 2025-12-04T10:26:21.6126456Z Forced recaches 0 2025-12-04T10:26:21.6126791Z Cache write errors 0 2025-12-04T10:26:21.6127138Z Cache errors 0 2025-12-04T10:26:21.6127351Z Compilations 44 2025-12-04T10:26:21.6127558Z Compilation failures 0 2025-12-04T10:26:21.6127780Z Non-cacheable compilations 0 2025-12-04T10:26:21.6128000Z Non-cacheable calls 0 2025-12-04T10:26:21.6128215Z Non-compilation calls 134 2025-12-04T10:26:21.6128430Z Unsupported compiler calls 0 2025-12-04T10:26:21.6128650Z Average cache write 0.042 s 2025-12-04T10:26:21.6128876Z Average compiler 11.328 s 2025-12-04T10:26:21.6129092Z Average cache read hit 0.075 s 2025-12-04T10:26:21.6129316Z Failed distributed compilations 0 2025-12-04T10:26:21.6129664Z Cache location s3, name: ossci-compiler-cache-circleci-v2, prefix: / 2025-12-04T10:26:21.6129991Z Version (client) 0.10.0 2025-12-04T10:26:21.6130216Z + sccache --stop-server 2025-12-04T10:26:21.6147726Z Stopping sccache server... 2025-12-04T10:26:21.6150863Z Compile requests 252 2025-12-04T10:26:21.6151286Z Compile requests executed 118 2025-12-04T10:26:21.6151688Z Cache hits 74 2025-12-04T10:26:21.6151967Z Cache hits (C/C++) 74 2025-12-04T10:26:21.6152226Z Cache misses 44 2025-12-04T10:26:21.6152488Z Cache misses (C/C++) 44 2025-12-04T10:26:21.6152747Z Cache hits rate 62.71 % 2025-12-04T10:26:21.6153021Z Cache hits rate (C/C++) 62.71 % 2025-12-04T10:26:21.6153286Z Cache timeouts 0 2025-12-04T10:26:21.6153540Z Cache read errors 0 2025-12-04T10:26:21.6153800Z Forced recaches 0 2025-12-04T10:26:21.6154065Z Cache write errors 0 2025-12-04T10:26:21.6154319Z Cache errors 0 2025-12-04T10:26:21.6154731Z Compilations 44 2025-12-04T10:26:21.6155177Z Compilation failures 0 2025-12-04T10:26:21.6161254Z Non-cacheable compilations 0 2025-12-04T10:26:21.6161498Z Non-cacheable calls 0 2025-12-04T10:26:21.6161736Z Non-compilation calls 134 2025-12-04T10:26:21.6161968Z Unsupported compiler calls 0 2025-12-04T10:26:21.6162183Z Average cache write 0.042 s 2025-12-04T10:26:21.6162405Z Average compiler 11.328 s 2025-12-04T10:26:21.6162620Z Average cache read hit 0.075 s 2025-12-04T10:26:21.6162837Z Failed distributed compilations 0 2025-12-04T10:26:21.6163156Z Cache location s3, name: ossci-compiler-cache-circleci-v2, prefix: / 2025-12-04T10:26:21.6163495Z Version (client) 0.10.0 2025-12-04T10:26:21.6163710Z + echo ::endgroup:: 2025-12-04T10:26:21.6164191Z ##[endgroup] 2025-12-04T10:26:21.6164451Z + cleanup_workspace 2025-12-04T10:26:21.6165072Z + echo 'sudo may print the following warning message that can be ignored. The chown command will still run.' 2025-12-04T10:26:21.6165901Z sudo may print the following warning message that can be ignored. The chown command will still run. 2025-12-04T10:26:21.6166580Z + echo ' sudo: setrlimit(RLIMIT_STACK): Operation not permitted' 2025-12-04T10:26:21.6167106Z sudo: setrlimit(RLIMIT_STACK): Operation not permitted 2025-12-04T10:26:21.6167571Z + echo 'For more details refer to https://github.com/sudo-project/sudo/issues/42' 2025-12-04T10:26:21.6168084Z For more details refer to https://github.com/sudo-project/sudo/issues/42 2025-12-04T10:26:21.6168433Z + sudo chown -R 1000 /var/lib/jenkins/workspace 2025-12-04T10:26:22.6039217Z ##[group]Run pytorch/test-infra/.github/actions/upload-benchmark-results@main 2025-12-04T10:26:22.6039591Z with: 2025-12-04T10:26:22.6039782Z benchmark-results-dir: test/test-reports 2025-12-04T10:26:22.6040009Z dry-run: false 2025-12-04T10:26:22.6040170Z schema-version: v3 2025-12-04T10:26:22.6040539Z github-token: *** 2025-12-04T10:26:22.6040844Z env: 2025-12-04T10:26:22.6040990Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:26:22.6041184Z HAS_NVIDIA_GPU: true 2025-12-04T10:26:22.6041418Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:26:22.6041817Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:26:22.6042178Z ##[endgroup] 2025-12-04T10:26:22.6058082Z ##[group]Run set -eux 2025-12-04T10:26:22.6058287Z set -eux 2025-12-04T10:26:22.6058436Z  2025-12-04T10:26:22.6058589Z if [[ -n "" ]]; then 2025-12-04T10:26:22.6058780Z  source "" 2025-12-04T10:26:22.6058970Z fi 2025-12-04T10:26:22.6059224Z python3 -mpip install boto3==1.35.33 psutil==7.0.0 pynvml==12.0.0 2025-12-04T10:26:22.6059531Z  2025-12-04T10:26:22.6059686Z DEVICE_NAME="" 2025-12-04T10:26:22.6059870Z DEVICE_TYPE="" 2025-12-04T10:26:22.6060030Z  2025-12-04T10:26:22.6060198Z if command -v nvidia-smi; then 2025-12-04T10:26:22.6060617Z  # NB: I'm using PyTorch here to get the device name, however, it needs to 2025-12-04T10:26:22.6061018Z  # install the correct version of PyTorch manually for now. Any PyTorch 2025-12-04T10:26:22.6061385Z  # version is fine, I just use 2.7.1 to satify PYPIDEP linter 2025-12-04T10:26:22.6061691Z  python3 -mpip install torch==2.7.1 2025-12-04T10:26:22.6061930Z elif command -v rocminfo; then 2025-12-04T10:26:22.6062222Z  # NB: Installing torch on ROCm runner with pip here causes CI to fail 2025-12-04T10:26:22.6062612Z  # with a memoryview is too large error only on MI300 runners. Is pip 2025-12-04T10:26:22.6062993Z  # version on ROCm runner there too old? As a workaround, let's use the 2025-12-04T10:26:22.6063326Z  # GPU device name coming from rocminfo instead 2025-12-04T10:26:22.6063572Z  DEVICE_NAME=rocm 2025-12-04T10:26:22.6063915Z  DEVICE_TYPE=$(rocminfo | grep "Marketing Name" | tail -n1 | awk -F':' '{print $2}' | xargs) 2025-12-04T10:26:22.6064269Z fi 2025-12-04T10:26:22.6064416Z  2025-12-04T10:26:22.6064607Z echo "DEVICE_NAME=$DEVICE_NAME" >> $GITHUB_ENV 2025-12-04T10:26:22.6064890Z echo "DEVICE_TYPE=$DEVICE_TYPE" >> $GITHUB_ENV 2025-12-04T10:26:22.6076860Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T10:26:22.6077126Z env: 2025-12-04T10:26:22.6077292Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:26:22.6077480Z HAS_NVIDIA_GPU: true 2025-12-04T10:26:22.6077705Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:26:22.6078100Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:26:22.6078454Z ##[endgroup] 2025-12-04T10:26:22.6114988Z + [[ -n '' ]] 2025-12-04T10:26:22.6115321Z + python3 -mpip install boto3==1.35.33 psutil==7.0.0 pynvml==12.0.0 2025-12-04T10:26:22.8283944Z Defaulting to user installation because normal site-packages is not writeable 2025-12-04T10:26:23.8732603Z Collecting boto3==1.35.33 2025-12-04T10:26:23.8913850Z Downloading boto3-1.35.33-py3-none-any.whl (139 kB) 2025-12-04T10:26:24.1845291Z Collecting psutil==7.0.0 2025-12-04T10:26:24.1886325Z Downloading psutil-7.0.0-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (277 kB) 2025-12-04T10:26:24.2196084Z Collecting pynvml==12.0.0 2025-12-04T10:26:24.2230672Z Downloading pynvml-12.0.0-py3-none-any.whl (26 kB) 2025-12-04T10:26:24.2682807Z Collecting s3transfer<0.11.0,>=0.10.0 2025-12-04T10:26:24.2722868Z Downloading s3transfer-0.10.4-py3-none-any.whl (83 kB) 2025-12-04T10:26:25.3914934Z Collecting botocore<1.36.0,>=1.35.33 2025-12-04T10:26:25.3952330Z Downloading botocore-1.35.99-py3-none-any.whl (13.3 MB) 2025-12-04T10:26:25.5405711Z Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /usr/lib/python3.9/site-packages (from boto3==1.35.33) (0.10.0) 2025-12-04T10:26:25.5833060Z Collecting nvidia-ml-py<13.0.0a0,>=12.0.0 2025-12-04T10:26:25.5873781Z Downloading nvidia_ml_py-12.575.51-py3-none-any.whl (47 kB) 2025-12-04T10:26:25.5981824Z Requirement already satisfied: urllib3<1.27,>=1.25.4 in /usr/lib/python3.9/site-packages (from botocore<1.36.0,>=1.35.33->boto3==1.35.33) (1.25.10) 2025-12-04T10:26:25.5986883Z Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/lib/python3.9/site-packages (from botocore<1.36.0,>=1.35.33->boto3==1.35.33) (2.8.1) 2025-12-04T10:26:25.7768060Z Requirement already satisfied: six>=1.5 in /usr/lib/python3.9/site-packages (from python-dateutil<3.0.0,>=2.1->botocore<1.36.0,>=1.35.33->boto3==1.35.33) (1.15.0) 2025-12-04T10:26:25.8893676Z Installing collected packages: botocore, s3transfer, nvidia-ml-py, pynvml, psutil, boto3 2025-12-04T10:26:26.4207060Z Attempting uninstall: nvidia-ml-py 2025-12-04T10:26:26.4209801Z Found existing installation: nvidia-ml-py 11.525.84 2025-12-04T10:26:26.4222418Z Uninstalling nvidia-ml-py-11.525.84: 2025-12-04T10:26:26.4444261Z Successfully uninstalled nvidia-ml-py-11.525.84 2025-12-04T10:26:26.4991389Z Attempting uninstall: psutil 2025-12-04T10:26:26.4993396Z Found existing installation: psutil 5.9.8 2025-12-04T10:26:26.5070690Z Uninstalling psutil-5.9.8: 2025-12-04T10:26:26.5077378Z Successfully uninstalled psutil-5.9.8 2025-12-04T10:26:26.6589883Z Successfully installed boto3-1.35.33 botocore-1.35.99 nvidia-ml-py-12.575.51 psutil-7.0.0 pynvml-12.0.0 s3transfer-0.10.4 2025-12-04T10:26:26.7451447Z + DEVICE_NAME= 2025-12-04T10:26:26.7451731Z + DEVICE_TYPE= 2025-12-04T10:26:26.7451952Z + command -v nvidia-smi 2025-12-04T10:26:26.7452242Z + python3 -mpip install torch==2.7.1 2025-12-04T10:26:26.7452517Z /usr/bin/nvidia-smi 2025-12-04T10:26:26.9620077Z Defaulting to user installation because normal site-packages is not writeable 2025-12-04T10:26:27.2420473Z Collecting torch==2.7.1 2025-12-04T10:26:27.2585358Z Downloading torch-2.7.1-cp39-cp39-manylinux_2_28_x86_64.whl (821.1 MB) 2025-12-04T10:26:39.5083007Z Collecting nvidia-nvjitlink-cu12==12.6.85 2025-12-04T10:26:39.5168273Z Downloading nvidia_nvjitlink_cu12-12.6.85-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl (19.7 MB) 2025-12-04T10:26:39.7232437Z Collecting nvidia-cusparse-cu12==12.5.4.2 2025-12-04T10:26:39.7269781Z Downloading nvidia_cusparse_cu12-12.5.4.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (216.6 MB) 2025-12-04T10:26:42.3845376Z Collecting nvidia-cufft-cu12==11.3.0.4 2025-12-04T10:26:42.3923194Z Downloading nvidia_cufft_cu12-11.3.0.4-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (200.2 MB) 2025-12-04T10:26:44.7357477Z Collecting networkx 2025-12-04T10:26:44.7405102Z Downloading networkx-3.2.1-py3-none-any.whl (1.6 MB) 2025-12-04T10:26:44.7901555Z Collecting nvidia-curand-cu12==10.3.7.77 2025-12-04T10:26:44.7973495Z Downloading nvidia_curand_cu12-10.3.7.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (56.3 MB) 2025-12-04T10:26:45.3994736Z Collecting nvidia-nccl-cu12==2.26.2 2025-12-04T10:26:45.4037506Z Downloading nvidia_nccl_cu12-2.26.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (201.3 MB) 2025-12-04T10:26:47.8352169Z Collecting filelock 2025-12-04T10:26:47.8392631Z Downloading filelock-3.19.1-py3-none-any.whl (15 kB) 2025-12-04T10:26:47.8847826Z Collecting sympy>=1.13.3 2025-12-04T10:26:47.8897208Z Downloading sympy-1.14.0-py3-none-any.whl (6.3 MB) 2025-12-04T10:26:47.9671696Z Collecting nvidia-cufile-cu12==1.11.1.6 2025-12-04T10:26:47.9744005Z Downloading nvidia_cufile_cu12-1.11.1.6-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (1.1 MB) 2025-12-04T10:26:48.0168956Z Collecting nvidia-cuda-runtime-cu12==12.6.77 2025-12-04T10:26:48.0240533Z Downloading nvidia_cuda_runtime_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (897 kB) 2025-12-04T10:26:48.0947963Z Collecting fsspec 2025-12-04T10:26:48.0988675Z Downloading fsspec-2025.10.0-py3-none-any.whl (200 kB) 2025-12-04T10:26:48.1330866Z Collecting nvidia-nvtx-cu12==12.6.77 2025-12-04T10:26:48.1368897Z Downloading nvidia_nvtx_cu12-12.6.77-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (89 kB) 2025-12-04T10:26:48.1718962Z Collecting nvidia-cuda-nvrtc-cu12==12.6.77 2025-12-04T10:26:48.1786951Z Downloading nvidia_cuda_nvrtc_cu12-12.6.77-py3-none-manylinux2014_x86_64.whl (23.7 MB) 2025-12-04T10:26:48.4110714Z Collecting nvidia-cuda-cupti-cu12==12.6.80 2025-12-04T10:26:48.4176907Z Downloading nvidia_cuda_cupti_cu12-12.6.80-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (8.9 MB) 2025-12-04T10:26:48.5325868Z Collecting triton==3.3.1 2025-12-04T10:26:48.5402682Z Downloading triton-3.3.1-cp39-cp39-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (155.6 MB) 2025-12-04T10:26:50.1569808Z Collecting nvidia-cudnn-cu12==9.5.1.17 2025-12-04T10:26:50.1652559Z Downloading nvidia_cudnn_cu12-9.5.1.17-py3-none-manylinux_2_28_x86_64.whl (571.0 MB) 2025-12-04T10:26:58.5535635Z Collecting nvidia-cusolver-cu12==11.7.1.2 2025-12-04T10:26:58.5606714Z Downloading nvidia_cusolver_cu12-11.7.1.2-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (158.2 MB) 2025-12-04T10:27:00.2654826Z Collecting nvidia-cusparselt-cu12==0.6.3 2025-12-04T10:27:00.2693990Z Downloading nvidia_cusparselt_cu12-0.6.3-py3-none-manylinux2014_x86_64.whl (156.8 MB) 2025-12-04T10:27:01.7501383Z Requirement already satisfied: typing-extensions>=4.10.0 in /home/ec2-user/.local/lib/python3.9/site-packages (from torch==2.7.1) (4.15.0) 2025-12-04T10:27:01.7502903Z Requirement already satisfied: jinja2 in /usr/lib/python3.9/site-packages (from torch==2.7.1) (2.11.3) 2025-12-04T10:27:01.7791947Z Collecting nvidia-cublas-cu12==12.6.4.1 2025-12-04T10:27:01.7862392Z Downloading nvidia_cublas_cu12-12.6.4.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (393.1 MB) 2025-12-04T10:27:07.0205216Z Requirement already satisfied: setuptools>=40.8.0 in /usr/lib/python3.9/site-packages (from triton==3.3.1->torch==2.7.1) (59.6.0) 2025-12-04T10:27:07.0500280Z Collecting mpmath<1.4,>=1.1.0 2025-12-04T10:27:07.0541759Z Downloading mpmath-1.3.0-py3-none-any.whl (536 kB) 2025-12-04T10:27:07.1363919Z Requirement already satisfied: MarkupSafe>=0.23 in /usr/lib64/python3.9/site-packages (from jinja2->torch==2.7.1) (1.1.1) 2025-12-04T10:27:07.4391658Z Installing collected packages: nvidia-nvjitlink-cu12, nvidia-cusparse-cu12, nvidia-cublas-cu12, mpmath, triton, sympy, nvidia-nvtx-cu12, nvidia-nccl-cu12, nvidia-cusparselt-cu12, nvidia-cusolver-cu12, nvidia-curand-cu12, nvidia-cufile-cu12, nvidia-cufft-cu12, nvidia-cudnn-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, networkx, fsspec, filelock, torch 2025-12-04T10:27:15.1866603Z WARNING: The scripts proton and proton-viewer are installed in '/home/ec2-user/.local/bin' which is not on PATH. 2025-12-04T10:27:15.1867543Z Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. 2025-12-04T10:27:18.6889012Z WARNING: The script isympy is installed in '/home/ec2-user/.local/bin' which is not on PATH. 2025-12-04T10:27:18.6889820Z Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. 2025-12-04T10:27:45.2853731Z WARNING: The scripts torchfrtrace and torchrun are installed in '/home/ec2-user/.local/bin' which is not on PATH. 2025-12-04T10:27:45.2854623Z Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location. 2025-12-04T10:27:45.4719790Z Successfully installed filelock-3.19.1 fsspec-2025.10.0 mpmath-1.3.0 networkx-3.2.1 nvidia-cublas-cu12-12.6.4.1 nvidia-cuda-cupti-cu12-12.6.80 nvidia-cuda-nvrtc-cu12-12.6.77 nvidia-cuda-runtime-cu12-12.6.77 nvidia-cudnn-cu12-9.5.1.17 nvidia-cufft-cu12-11.3.0.4 nvidia-cufile-cu12-1.11.1.6 nvidia-curand-cu12-10.3.7.77 nvidia-cusolver-cu12-11.7.1.2 nvidia-cusparse-cu12-12.5.4.2 nvidia-cusparselt-cu12-0.6.3 nvidia-nccl-cu12-2.26.2 nvidia-nvjitlink-cu12-12.6.85 nvidia-nvtx-cu12-12.6.77 sympy-1.14.0 torch-2.7.1 triton-3.3.1 2025-12-04T10:27:45.9759037Z + echo DEVICE_NAME= 2025-12-04T10:27:45.9759377Z + echo DEVICE_TYPE= 2025-12-04T10:27:45.9789758Z ##[group]Run set -eux 2025-12-04T10:27:45.9789958Z set -eux 2025-12-04T10:27:45.9790120Z  2025-12-04T10:27:45.9790296Z if [[ -z "${GITHUB_TOKEN}" ]]; then 2025-12-04T10:27:45.9790549Z  echo "Missing github-token input" 2025-12-04T10:27:45.9790762Z  exit 1 2025-12-04T10:27:45.9790914Z fi 2025-12-04T10:27:45.9800122Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T10:27:45.9800397Z env: 2025-12-04T10:27:45.9800554Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:45.9800753Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:45.9800977Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:45.9801373Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:45.9801736Z DEVICE_NAME: 2025-12-04T10:27:45.9801894Z DEVICE_TYPE: 2025-12-04T10:27:45.9802257Z GITHUB_TOKEN: *** 2025-12-04T10:27:45.9802589Z ##[endgroup] 2025-12-04T10:27:45.9936494Z + [[ -z *** ]] 2025-12-04T10:27:46.0020037Z ##[group]Run pytorch/test-infra/.github/actions/get-workflow-job-id@main 2025-12-04T10:27:46.0020340Z with: 2025-12-04T10:27:46.0020623Z github-token: *** 2025-12-04T10:27:46.0020789Z env: 2025-12-04T10:27:46.0020941Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:46.0021131Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:46.0021352Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:46.0021753Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:46.0022115Z DEVICE_NAME: 2025-12-04T10:27:46.0022292Z DEVICE_TYPE: 2025-12-04T10:27:46.0022444Z ##[endgroup] 2025-12-04T10:27:46.0113364Z ##[group]Run set -eux 2025-12-04T10:27:46.0113556Z set -eux 2025-12-04T10:27:46.0113715Z  2025-12-04T10:27:46.0114073Z python3 "${GITHUB_ACTION_PATH}/../../scripts/get_workflow_job_id.py" "${GITHUB_RUN_ID}" "${RUNNER_NAME}" 2025-12-04T10:27:46.0121400Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T10:27:46.0121671Z env: 2025-12-04T10:27:46.0121825Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:46.0122015Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:46.0122231Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:46.0122631Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:46.0122983Z DEVICE_NAME: 2025-12-04T10:27:46.0123142Z DEVICE_TYPE: 2025-12-04T10:27:46.0123414Z GITHUB_TOKEN: *** 2025-12-04T10:27:46.0123583Z ##[endgroup] 2025-12-04T10:27:46.0150758Z + python3 /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/get-workflow-job-id/../../scripts/get_workflow_job_id.py 19922826259 i-07df7d64debf86ede 2025-12-04T10:27:48.6109947Z setting job-id=57120265563 2025-12-04T10:27:48.6110704Z setting job-name=linux-jammy-cuda12.8-py3.10-gcc11-debug / test (default, 1, 7, linux.g6.4xlarge.experimental.nvidia.gpu, oncall:debug-build, rerun_disabled... 2025-12-04T10:27:48.6234293Z ##[group]Run set -eux 2025-12-04T10:27:48.6234492Z set -eux 2025-12-04T10:27:48.6234650Z  2025-12-04T10:27:48.6234806Z if [[ -n "" ]]; then 2025-12-04T10:27:48.6234995Z  source "" 2025-12-04T10:27:48.6235152Z fi 2025-12-04T10:27:48.6235299Z  2025-12-04T10:27:48.6235575Z python3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_metadata.py" \ 2025-12-04T10:27:48.6235944Z  --schema-version "${SCHEMA_VERSION}" \ 2025-12-04T10:27:48.6236392Z  --repo "${REPO}" \ 2025-12-04T10:27:48.6236621Z  --head-branch "${HEAD_BRANCH}" \ 2025-12-04T10:27:48.6236850Z  --head-sha "${HEAD_SHA}" \ 2025-12-04T10:27:48.6237085Z  --workflow-id "${WORKFLOW_RUN_ID}" \ 2025-12-04T10:27:48.6237332Z  --run-attempt "${RUN_ATTEMPT}" \ 2025-12-04T10:27:48.6237552Z  --job-id "${JOB_ID}" \ 2025-12-04T10:27:48.6237851Z  --job-name "${JOB_NAME}" 2025-12-04T10:27:48.6245587Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T10:27:48.6245865Z env: 2025-12-04T10:27:48.6246022Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:48.6246205Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:48.6246430Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:48.6246825Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:48.6247198Z DEVICE_NAME: 2025-12-04T10:27:48.6247357Z DEVICE_TYPE: 2025-12-04T10:27:48.6247512Z SCHEMA_VERSION: v3 2025-12-04T10:27:48.6247697Z REPO: pytorch/pytorch 2025-12-04T10:27:48.6247878Z HEAD_BRANCH: refs/heads/main 2025-12-04T10:27:48.6248108Z HEAD_SHA: ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 2025-12-04T10:27:48.6248369Z WORKFLOW_RUN_ID: 19922826259 2025-12-04T10:27:48.6248546Z RUN_ATTEMPT: 1 2025-12-04T10:27:48.6248707Z JOB_ID: 57120265563 2025-12-04T10:27:48.6249202Z JOB_NAME: linux-jammy-cuda12.8-py3.10-gcc11-debug / test (default, 1, 7, linux.g6.4xlarge.experimental.nvidia.gpu, oncall:debug-build, rerun_disabled... 2025-12-04T10:27:48.6249820Z ##[endgroup] 2025-12-04T10:27:48.6279682Z + [[ -n '' ]] 2025-12-04T10:27:48.6281352Z + python3 /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/benchmarks/gather_metadata.py --schema-version v3 --repo pytorch/pytorch --head-branch refs/heads/main --head-sha ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32 --workflow-id 19922826259 --run-attempt 1 --job-id 57120265563 --job-name 'linux-jammy-cuda12.8-py3.10-gcc11-debug / test (default, 1, 7, linux.g6.4xlarge.experimental.nvidia.gpu, oncall:debug-build, rerun_disabled...' 2025-12-04T10:27:48.6644166Z ##[group]Run set -eux 2025-12-04T10:27:48.6644358Z set -eux 2025-12-04T10:27:48.6644507Z  2025-12-04T10:27:48.6644667Z if [[ -n "" ]]; then 2025-12-04T10:27:48.6644871Z  source "" 2025-12-04T10:27:48.6645032Z fi 2025-12-04T10:27:48.6645178Z  2025-12-04T10:27:48.6645466Z python3 "${GITHUB_ACTION_PATH}/../../scripts/benchmarks/gather_runners_info.py" 2025-12-04T10:27:48.6652731Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T10:27:48.6653011Z env: 2025-12-04T10:27:48.6653170Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:48.6653351Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:48.6653582Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:48.6653992Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:48.6654374Z DEVICE_NAME: 2025-12-04T10:27:48.6654531Z DEVICE_TYPE: 2025-12-04T10:27:48.6654688Z ##[endgroup] 2025-12-04T10:27:48.6681751Z + [[ -n '' ]] 2025-12-04T10:27:48.6682906Z + python3 /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/benchmarks/gather_runners_info.py 2025-12-04T10:27:49.5171001Z /home/ec2-user/.local/lib/python3.9/site-packages/torch/_subclasses/functional_tensor.py:276: UserWarning: Failed to initialize NumPy: No module named 'numpy' (Triggered internally at /pytorch/torch/csrc/utils/tensor_numpy.cpp:81.) 2025-12-04T10:27:49.5172617Z cpu = _conversion_method_template(device=torch.device("cpu")) 2025-12-04T10:27:50.4323849Z ##[group]Run set -eux 2025-12-04T10:27:50.4324049Z set -eux 2025-12-04T10:27:50.4324209Z  2025-12-04T10:27:50.4324393Z # TODO (huydhn): Implement this part 2025-12-04T10:27:50.4324671Z echo "dependencies={}" >> "${GITHUB_OUTPUT}" 2025-12-04T10:27:50.4332770Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T10:27:50.4333071Z env: 2025-12-04T10:27:50.4333226Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:50.4333434Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:50.4333660Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:50.4334056Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:50.4334497Z DEVICE_NAME: 2025-12-04T10:27:50.4334665Z DEVICE_TYPE: 2025-12-04T10:27:50.4334818Z ##[endgroup] 2025-12-04T10:27:50.4364001Z + echo 'dependencies={}' 2025-12-04T10:27:50.4430816Z ##[group]Run set -eux 2025-12-04T10:27:50.4431030Z set -eux 2025-12-04T10:27:50.4431182Z  2025-12-04T10:27:50.4431340Z if [[ -n "" ]]; then 2025-12-04T10:27:50.4431536Z  source "" 2025-12-04T10:27:50.4431703Z fi 2025-12-04T10:27:50.4431854Z  2025-12-04T10:27:50.4432061Z if [[ ! -d "${BENCHMARK_RESULTS_DIR}" ]]; then 2025-12-04T10:27:50.4432371Z  echo "${BENCHMARK_RESULTS_DIR} does not exist, skipping" 2025-12-04T10:27:50.4432721Z  # We don't want the job to fail if the directory doesn't exist 2025-12-04T10:27:50.4432999Z  exit 0 2025-12-04T10:27:50.4433160Z fi 2025-12-04T10:27:50.4433312Z  2025-12-04T10:27:50.4433592Z if [[ "${DRY_RUN}" == "true" ]]; then 2025-12-04T10:27:50.4433932Z  python3 "${GITHUB_ACTION_PATH}/../../scripts/upload_benchmark_results.py" \ 2025-12-04T10:27:50.4434320Z  --benchmark-results-dir "${BENCHMARK_RESULTS_DIR}" \ 2025-12-04T10:27:50.4434623Z  --metadata "${BENCHMARK_METADATA}" \ 2025-12-04T10:27:50.4434876Z  --runners "${RUNNER_INFO}" \ 2025-12-04T10:27:50.4435114Z  --dependencies "${DEPENDENCIES}" \ 2025-12-04T10:27:50.4435334Z  --dry-run 2025-12-04T10:27:50.4435503Z else 2025-12-04T10:27:50.4435774Z  python3 "${GITHUB_ACTION_PATH}/../../scripts/upload_benchmark_results.py" \ 2025-12-04T10:27:50.4436147Z  --benchmark-results-dir "${BENCHMARK_RESULTS_DIR}" \ 2025-12-04T10:27:50.4436436Z  --metadata "${BENCHMARK_METADATA}" \ 2025-12-04T10:27:50.4436674Z  --runners "${RUNNER_INFO}" \ 2025-12-04T10:27:50.4436899Z  --dependencies "${DEPENDENCIES}" 2025-12-04T10:27:50.4437117Z fi 2025-12-04T10:27:50.4444184Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T10:27:50.4444456Z env: 2025-12-04T10:27:50.4444606Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:50.4444806Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:50.4445035Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:50.4445431Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:50.4445791Z DEVICE_NAME: 2025-12-04T10:27:50.4445952Z DEVICE_TYPE: 2025-12-04T10:27:50.4446133Z BENCHMARK_RESULTS_DIR: test/test-reports 2025-12-04T10:27:50.4446368Z DRY_RUN: false 2025-12-04T10:27:50.4447485Z BENCHMARK_METADATA: {"timestamp": 1764844068, "schema_version": "v3", "name": "linux-jammy-cuda12.8-py3.10-gcc11-debug / test (default, 1, 7, linux.g6.4xlarge.experimental.nvidia.gpu, oncall:debug-build, rerun_disabled...", "repo": "pytorch/pytorch", "head_branch": "refs/heads/main", "head_sha": "ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32", "workflow_id": 19922826259, "run_attempt": 1, "job_id": 57120265563} 2025-12-04T10:27:50.4448975Z RUNNER_INFO: [{"cpu_info": "x86_64", "cpu_count": 16, "avail_mem_in_gb": 60, "extra_info": {"hostname": "ip-10-0-6-74.ec2.internal"}, "name": "cuda", "type": "NVIDIA L4", "gpu_count": 1, "avail_gpu_mem_in_gb": 22}] 2025-12-04T10:27:50.4449521Z DEPENDENCIES: {} 2025-12-04T10:27:50.4449680Z ##[endgroup] 2025-12-04T10:27:50.4474992Z + [[ -n '' ]] 2025-12-04T10:27:50.4475244Z + [[ ! -d test/test-reports ]] 2025-12-04T10:27:50.4475505Z + [[ false == \t\r\u\e ]] 2025-12-04T10:27:50.4478544Z + python3 /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py --benchmark-results-dir test/test-reports --metadata '{"timestamp": 1764844068, "schema_version": "v3", "name": "linux-jammy-cuda12.8-py3.10-gcc11-debug / test (default, 1, 7, linux.g6.4xlarge.experimental.nvidia.gpu, oncall:debug-build, rerun_disabled...", "repo": "pytorch/pytorch", "head_branch": "refs/heads/main", "head_sha": "ffd9b0fb4355e97af82fc42cf185c3ffa0fc0a32", "workflow_id": 19922826259, "run_attempt": 1, "job_id": 57120265563}' --runners '[{"cpu_info": "x86_64", "cpu_count": 16, "avail_mem_in_gb": 60, "extra_info": {"hostname": "ip-10-0-6-74.ec2.internal"}, "name": "cuda", "type": "NVIDIA L4", "gpu_count": 1, "avail_gpu_mem_in_gb": 22}]' --dependencies '{}' 2025-12-04T10:27:50.6344968Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'lazy/test_ts_opinfo'}], 'excluded': []} from test/test-reports/td_exclusions-e77b946010afe336c823.json is not a benchmark record, skipping 2025-12-04T10:27:50.6346484Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T10:27:50.6402157Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'inductor/test_aot_inductor'}, {'test_file': 'inductor/test_torchinductor'}, {'test_file': 'inductor/test_torchinductor_dynamic_shapes'}, {'test_file': 'inductor/test_torchinductor_codegen_dynamic_shapes'}, {'test_file': 'inductor/test_kernel_benchmark'}, {'test_file': 'inductor/test_torchinductor_opinfo'}, {'test_file': 'inductor/test_pattern_matcher'}, {'test_file': 'inductor/test_cuda_repro'}, {'test_file': 'inductor/test_cudagraph_trees'}, {'test_file': 'dynamo/test_activation_checkpointing'}, {'test_file': 'dynamo/test_logging'}, {'test_file': 'dynamo/test_repros'}, {'test_file': 'inductor/test_flex_attention'}, {'test_file': 'inductor/test_cuda_select_algorithm'}, {'test_file': 'inductor/test_compile_subprocess'}, {'test_file': 'inductor/test_flex_decoding'}, {'test_file': 'inductor/test_deterministic'}, {'test_file': 'export/test_retraceability'}, {'test_file': 'inductor/test_fp8'}, {'test_file': 'dynamo/test_model_output'}, {'test_file': 'inductor/test_triton_kernels'}, {'test_file': 'inductor/test_extension_backend'}, {'test_file': 'inductor/test_native_matmul'}, {'test_file': 'inductor/test_loop_ordering'}, {'test_file': 'export/test_serdes'}, {'test_file': 'dynamo/test_regional_inductor'}, {'test_file': 'dynamo/test_fx_graph_runnable'}, {'test_file': 'dynamo/test_backends'}, {'test_file': 'inductor/test_aot_inductor_package'}, {'test_file': 'inductor/test_decompose_mem_bound_mm'}, {'test_file': 'inductor/test_op_dtype_prop'}, {'test_file': 'inductor/test_online_softmax'}, {'test_file': 'inductor/test_memory'}, {'test_file': 'dynamo/test_streams'}, {'test_file': 'inductor/test_unbacked_symints'}, {'test_file': 'inductor/test_scatter_optimization'}, {'test_file': 'inductor/test_mix_order_reduction'}, {'test_file': 'inductor/test_padding'}, {'test_file': 'dynamo/test_aot_compile'}, {'test_file': 'dynamo/test_sets'}, {'test_file': 'dynamo/test_wrap_inductor_compiled_regions'}, {'test_file': 'dynamo/test_callback'}, {'test_file': 'dynamo/test_compiler_bisector'}, {'test_file': 'inductor/test_custom_op_autotune'}, {'test_file': 'inductor/test_cudagraph_trees_expandable_segments'}, {'test_file': 'dynamo/test_decorators'}, {'test_file': 'test_privateuseone_python_backend'}, {'test_file': 'inductor/test_collective_autotuning'}, {'test_file': 'test_varlen_attention'}, {'test_file': 'test_cuda'}, {'test_file': 'test_transformers'}, {'test_file': 'test_matmul_cuda'}, {'test_file': 'test_autograd'}, {'test_file': 'test_sparse'}, {'test_file': 'higher_order_ops/test_local_map'}, {'test_file': 'test_dataloader'}, {'test_file': 'higher_order_ops/test_invoke_subgraph'}, {'test_file': 'test_decomp'}, {'test_file': 'test_ci_sanity_check_fail'}, {'test_file': 'test_ops_fwd_gradients'}, {'test_file': 'test_meta'}, {'test_file': 'test_ops_jit'}, {'test_file': 'test_ops_gradients'}, {'test_file': 'test_nestedtensor'}, {'test_file': 'test_linalg'}, {'test_file': 'test_cuda_expandable_segments'}, {'test_file': 'test_public_bindings'}, {'test_file': 'test_ops'}, {'test_file': 'functorch/test_dims'}, {'test_file': 'test_sparse_csr'}, {'test_file': 'functorch/test_ops'}, {'test_file': 'functorch/test_vmap'}, {'test_file': 'test_overrides'}, {'test_file': 'test_torchfuzz_repros'}, {'test_file': 'inductor/test_max_autotune'}, {'test_file': 'doctests'}, {'test_file': 'inductor/test_select_algorithm'}, {'test_file': 'inductor/test_group_batch_fusion'}, {'test_file': 'dynamo/test_dynamic_shapes'}, {'test_file': 'inductor/test_cpu_repro'}, {'test_file': 'inductor/test_smoke'}, {'test_file': 'dynamo/test_after_aot'}, {'test_file': 'inductor/test_snode_runtime'}, {'test_file': 'inductor/test_minifier'}, {'test_file': 'inductor/test_compiled_autograd'}, {'test_file': 'inductor/test_custom_lowering'}, {'test_file': 'inductor/test_perf'}, {'test_file': 'inductor/test_fused_attention'}, {'test_file': 'inductor/test_binary_folding'}, {'test_file': 'inductor/test_mkldnn_pattern_matcher'}, {'test_file': 'inductor/test_inductor_freezing'}, {'test_file': 'inductor/test_layout_optim'}, {'test_file': 'dynamo/test_unspec'}, {'test_file': 'dynamo/test_higher_order_ops'}, {'test_file': 'inductor/test_mmdecomp'}, {'test_file': 'dynamo/test_ctx_manager'}, {'test_file': 'dynamo/test_exc'}, {'test_file': 'dynamo/test_misc'}, {'test_file': 'inductor/test_cpu_select_algorithm'}, {'test_file': 'inductor/test_aot_inductor_arrayref'}, {'test_file': 'inductor/test_cpu_cpp_wrapper'}, {'test_file': 'inductor/test_cutlass_backend'}, {'test_file': 'inductor/test_triton_cpu_backend'}, {'test_file': 'inductor/test_torchinductor_strided_blocks'}, {'test_file': 'test_custom_ops'}, {'test_file': 'test_content_store'}, {'test_file': 'inductor/test_halide'}, {'test_file': 'inductor/test_multi_kernel'}, {'test_file': 'inductor/test_analysis'}, {'test_file': 'inductor/test_pad_mm'}, {'test_file': 'inductor/test_triton_syntax'}, {'test_file': 'inductor/test_triton_extension_backend'}, {'test_file': 'test_sparse_semi_structured'}, {'test_file': 'inductor/test_op_completeness'}, {'test_file': 'inductor/test_subgraph_choice'}, {'test_file': 'inductor/test_b2b_gemm'}, {'test_file': 'inductor/test_triton_heuristics'}, {'test_file': 'inductor/test_cutedsl_grouped_mm'}, {'test_file': 'inductor/test_cpp_wrapper_hipify'}, {'test_file': 'inductor/test_ck_backend'}, {'test_file': 'inductor/test_inductor_utils'}, {'test_file': 'inductor/test_template_heuristics_registry'}, {'test_file': 'inductor/test_async_compile'}, {'test_file': 'inductor/test_gpu_cpp_wrapper'}, {'test_file': 'export/test_export_training_ir_to_run_decomp'}, {'test_file': 'dynamo/test_deque_reconstruct'}, {'test_file': 'inductor/test_utils'}, {'test_file': 'inductor/test_indexing'}, {'test_file': 'inductor/test_inductor_annotations'}, {'test_file': 'inductor/test_compile_worker'}, {'test_file': 'dynamo/test_einops'}, {'test_file': 'inductor/test_external_callables'}, {'test_file': 'test_testing'}, {'test_file': 'dynamo/test_fx_passes_pre_grad'}, {'test_file': 'inductor/test_autoheuristic'}, {'test_file': 'export/test_strict_export_v2'}, {'test_file': 'inductor/test_flex_flash'}, {'test_file': 'inductor/test_segmented_tree'}, {'test_file': 'inductor/test_kernel_optimization'}, {'test_file': 'inductor/test_metrics'}, {'test_file': 'export/test_unflatten_training_ir'}, {'test_file': 'inductor/test_fx_fusion'}, {'test_file': 'inductor/test_xpu_basic'}, {'test_file': 'dynamo/test_inline_and_install'}, {'test_file': 'export/test_functionalized_assertions'}, {'test_file': 'inductor/test_selective_lowering'}, {'test_file': 'dynamo/test_base_output'}, {'test_file': 'inductor/test_lookup_table'}, {'test_file': 'inductor/test_cooperative_reductions'}, {'test_file': 'export/test_serialize'}, {'test_file': 'inductor/test_cutedsl_template'}, {'test_file': 'inductor/test_benchmark_fusion'}, {'test_file': 'inductor/test_inductor_scheduler'}, {'test_file': 'inductor/test_move_constructors_to_gpu'}, {'test_file': 'export/test_export_strict'}, {'test_file': 'dynamo/test_modules'}, {'test_file': 'inductor/test_remote_cache'}, {'test_file': 'inductor/test_coordinate_descent_tuner'}, {'test_file': 'inductor/test_inplace_padding'}, {'test_file': 'inductor/test_cudacodecache'}, {'test_file': 'inductor/test_minifier_utils'}, {'test_file': 'inductor/test_debug_trace'}, {'test_file': 'dynamo/test_recompiles'}, {'test_file': 'inductor/test_foreach'}, {'test_file': 'export/test_tree_utils'}, {'test_file': 'inductor/test_triton_wrapper'}, {'test_file': 'inductor/test_static_cuda_launcher'}, {'test_file': 'export/test_dynamic_shapes'}, {'test_file': 'dynamo/test_sdpa'}, {'test_file': 'dynamo/test_utils'}, {'test_file': 'inductor/test_provenance_tracing'}, {'test_file': 'inductor/test_combo_kernels'}, {'test_file': 'inductor/test_codegen_triton'}, {'test_file': 'dynamo/test_frame_init'}, {'test_file': 'inductor/test_device_assert'}, {'test_file': 'dynamo/test_skip_non_tensor'}, {'test_file': 'dynamo/test_skip_guard_eval_unsafe'}, {'test_file': 'dynamo/test_interop'}, {'test_file': 'functorch/test_eager_transforms'}, {'test_file': 'inductor/test_control_deps'}, {'test_file': 'inductor/test_benchmarking'}, {'test_file': 'inductor/test_helion_kernels'}, {'test_file': 'inductor/test_quantization'}, {'test_file': 'inductor/test_best_config'}, {'test_file': 'export/test_tools'}, {'test_file': 'inductor/test_compiled_optimizers'}, {'test_file': 'dynamo/test_buffers_override'}, {'test_file': 'inductor/test_inplacing_pass'}, {'test_file': 'inductor/test_aot_inductor_custom_ops'}, {'test_file': 'inductor/test_split_cat_fx_passes'}, {'test_file': 'inductor/test_profiler'}, {'test_file': 'inductor/test_memory_planning'}, {'test_file': 'inductor/test_mem_estimation'}, {'test_file': 'dynamo/test_view'}, {'test_file': 'inductor/test_cutlass_evt'}, {'test_file': 'dynamo/test_reconstruct'}, {'test_file': 'dynamo/test_aot_autograd'}, {'test_file': 'export/test_cpp_serdes'}, {'test_file': 'inductor/test_cache'}, {'test_file': 'inductor/test_block_analysis'}, {'test_file': 'dynamo/test_subgraphs'}, {'test_file': 'dynamo/test_pre_dispatch'}, {'test_file': 'inductor/test_custom_post_grad_passes'}, {'test_file': 'dynamo/test_fx_annotate'}, {'test_file': 'dynamo/test_pgo'}, {'test_file': 'dynamo/test_config'}, {'test_file': 'dynamo/test_metrics_context'}, {'test_file': 'export/test_package'}, {'test_file': 'export/test_export_opinfo'}, {'test_file': 'dynamo/test_nops'}, {'test_file': 'inductor/test_graph_transform_observer'}, {'test_file': 'inductor/test_aot_inductor_utils'}, {'test_file': 'export/test_db'}, {'test_file': 'dynamo/test_export_mutations'}, {'test_file': 'inductor/test_config'}, {'test_file': 'inductor/test_dependencies'}, {'test_file': 'inductor/test_fuzzer'}, {'test_file': 'dynamo/test_global'}, {'test_file': 'inductor/test_control_flow'}, {'test_file': 'dynamo/test_graph_region_tracker'}, {'test_file': 'dynamo/test_unittest'}, {'test_file': 'inductor/test_compile'}, {'test_file': 'dynamo/test_functions'}, {'test_file': 'inductor/test_ordered_set'}, {'test_file': 'inductor/test_pallas'}, {'test_file': 'dynamo/test_install_free_tensors'}, {'test_file': 'inductor/test_torchinductor_codegen_config_overrides'}, {'test_file': 'export/test_passes'}, {'test_file': 'dynamo/test_autograd_function'}, {'test_file': 'inductor/test_codecache'}, {'test_file': 'dynamo/test_cudagraphs'}, {'test_file': 'inductor/test_alignment'}, {'test_file': 'dynamo/test_profiler'}, {'test_file': 'dynamo/test_guard_serialization'}, {'test_file': 'dynamo/test_compile'}, {'test_file': 'dynamo/test_nested_graph_breaks'}, {'test_file': 'dynamo/test_dicts'}, {'test_file': 'inductor/test_needs_exact_strides'}, {'test_file': 'inductor/test_auto_functionalize'}, {'test_file': 'inductor/test_split_cat_fx_aten_passes'}, {'test_file': 'inductor/test_minifier_isolate'}, {'test_file': 'dynamo/test_list'}, {'test_file': 'dynamo/test_resume'}, {'test_file': 'inductor/test_augmented_graph_helper'}, {'test_file': 'dynamo/test_deviceguard'}, {'test_file': 'dynamo/test_sources'}, {'test_file': 'dynamo/test_backward_higher_order_ops'}, {'test_file': 'dynamo/test_modes'}, {'test_file': 'dynamo/test_optimizers'}, {'test_file': 'export/test_torchbind'}, {'test_file': 'inductor/test_custom_partitioner_fn'}, {'test_file': 'dynamo/test_debug_utils'}, {'test_file': 'dynamo/test_base_hop'}, {'test_file': 'dynamo/test_export'}, {'test_file': 'dynamo/test_package'}, {'test_file': 'inductor/test_efficient_conv_bn_eval'}, {'test_file': 'inductor/test_torchbind'}, {'test_file': 'dynamo/test_python_dispatcher'}, {'test_file': 'export/test_swap'}, {'test_file': 'export/test_unflatten'}, {'test_file': 'dynamo/test_verify_correctness'}, {'test_file': 'inductor/test_fxir_backend'}, {'test_file': 'dynamo/test_cudagraphs_expandable_segments'}, {'test_file': 'inductor/test_caching'}, {'test_file': 'dynamo/test_aot_autograd_cache'}, {'test_file': 'dynamo/test_flat_apply'}, {'test_file': 'dynamo/test_input_attr_tracking'}, {'test_file': 'dynamo/test_graph_deduplication'}, {'test_file': 'inductor/test_distributed_patterns'}, {'test_file': 'dynamo/test_structured_trace'}, {'test_file': 'dynamo/test_error_messages'}, {'test_file': 'dynamo/test_bytecode_utils'}, {'test_file': 'dynamo/test_fake_distributed'}, {'test_file': 'inductor/test_mps_basic'}, {'test_file': 'export/test_nativert'}, {'test_file': 'export/test_hop'}, {'test_file': 'dynamo/test_tree_map'}, {'test_file': 'dynamo/test_minifier'}, {'test_file': 'dynamo/test_guard_manager'}, {'test_file': 'export/test_schema'}, {'test_file': 'dynamo/test_torchrec'}, {'test_file': 'export/test_pass_infra'}, {'test_file': 'dynamo/test_recompile_ux'}, {'test_file': 'export/test_experimental'}, {'test_file': 'export/test_converter'}, {'test_file': 'export/test_export'}, {'test_file': 'test_model_exports_to_core_aten'}, {'test_file': 'dynamo/test_precompile_context'}, {'test_file': 'dynamo/test_trace_rules'}, {'test_file': 'export/test_upgrader'}, {'test_file': 'dynamo/test_hooks'}, {'test_file': 'dynamo/test_reorder_logs'}, {'test_file': 'dynamo/test_subclasses'}, {'test_file': 'dynamo/test_exceptions'}, {'test_file': 'dynamo/test_generator'}, {'test_file': 'export/test_lift_unlift'}, {'test_file': 'export/test_verifier'}, {'test_file': 'export/test_sparse'}, {'test_file': 'dynamo/test_python_autograd'}, {'test_file': 'export/test_draft_export'}, {'test_file': 'dynamo/test_comptime'}, {'test_file': 'test_sort_and_select'}, {'test_file': 'functorch/test_rearrange'}, {'test_file': 'functorch/test_parsing'}, {'test_file': 'test_package'}, {'test_file': 'profiler/test_profiler'}, {'test_file': 'test_mkl_verbose'}, {'test_file': 'test_comparison_utils'}, {'test_file': 'functorch/test_ac_logging'}, {'test_file': 'test_mkldnn_verbose'}, {'test_file': 'test_cpp_api_parity'}, {'test_file': 'test_utils_config_module'}, {'test_file': 'test_hop_infra'}, {'test_file': 'test_appending_byte_serializer'}, {'test_file': 'test_license'}, {'test_file': 'test_ao_sparsity'}, {'test_file': 'test_autoload'}, {'test_file': 'nn/attention/test_open_registry'}, {'test_file': 'xpu/test_fusion'}, {'test_file': 'test_as_strided'}, {'test_file': 'test_foreach'}, {'test_file': 'test_proxy_tensor'}, {'test_file': 'torch_np/test_binary_ufuncs'}, {'test_file': 'torch_np/test_unary_ufuncs'}, {'test_file': 'test_utils_filelock'}, {'test_file': 'test_extension_utils'}, {'test_file': 'test_rename_privateuse1_to_existing_device'}, {'test_file': 'nn/attention/test_fa4'}, {'test_file': 'typing/test_python_operators'}, {'test_file': 'test_functionalization'}, {'test_file': 'torch_np/test_dtype'}, {'test_file': 'test_file_check'}, {'test_file': 'profiler/test_kineto'}, {'test_file': 'test_flop_counter'}, {'test_file': 'backends/xeon/test_launch'}, {'test_file': 'test_show_pickle'}, {'test_file': 'test_openmp'}, {'test_file': 'test_expanded_weights'}, {'test_file': 'test_module_tracker'}, {'test_file': 'torch_np/numpy_tests/core/test_scalarinherit'}, {'test_file': 'test_tensorexpr_pybind'}, {'test_file': 'test_fx_experimental'}, {'test_file': 'functorch/test_ac_knapsack'}, {'test_file': 'torch_np/test_nep50_examples'}, {'test_file': 'test_torch'}, {'test_file': 'xpu/test_gemm'}, {'test_file': 'test_fx_passes'}, {'test_file': 'functorch/test_logging'}, {'test_file': 'test_namedtensor'}, {'test_file': 'test_tensorexpr'}, {'test_file': 'functorch/test_minifier'}, {'test_file': 'higher_order_ops/test_invoke_quant'}, {'test_file': 'torch_np/test_basic'}, {'test_file': 'test_jiterator'}, {'test_file': 'test_native_functions'}, {'test_file': 'test_typing'}, {'test_file': 'higher_order_ops/test_with_effects'}, {'test_file': 'test_weak'}, {'test_file': 'test_complex'}, {'test_file': 'test_optim'}, {'test_file': 'lazy/test_functionalization'}, {'test_file': 'torch_np/test_random'}, {'test_file': 'nn/test_multihead_attention'}, {'test_file': 'test_legacy_vmap'}, {'test_file': 'lazy/test_bindings'}, {'test_file': 'xpu/test_conv'}, {'test_file': 'test_utils'}, {'test_file': 'test_pytree'}, {'test_file': 'test_namedtuple_return_api'}, {'test_file': 'profiler/test_record_function'}, {'test_file': 'test_compile_benchmark_util'}, {'test_file': 'test_set_default_mobile_cpu_allocator'}, {'test_file': 'test_fake_tensor'}, {'test_file': 'test_stateless'}, {'test_file': 'functorch/test_ac'}, {'test_file': 'test_binary_ufuncs'}, {'test_file': 'higher_order_ops/test_print'}, {'test_file': 'test_per_overload_api'}, {'test_file': 'torch_np/numpy_tests/core/test_einsum'}, {'test_file': 'test_multiprocessing'}, {'test_file': 'test_out_dtype_op'}, {'test_file': 'torch_np/test_ufuncs_basic'}, {'test_file': 'lazy/test_step_closures'}, {'test_file': 'functorch/dim/test_getsetitem'}, {'test_file': 'test_fx'}, {'test_file': 'test_numpy_interop'}, {'test_file': 'profiler/test_cpp_thread'}, {'test_file': 'test_hub'}, {'test_file': 'test_segment_reductions'}, {'test_file': 'test_opaque_obj_v2'}, {'test_file': 'test_autograd_fallback'}, {'test_file': 'test_type_hints'}, {'test_file': 'functorch/test_aot_joint_with_descriptors'}, {'test_file': 'test_functionalization_of_rng_ops'}, {'test_file': 'test_fx_reinplace_pass'}, {'test_file': 'functorch/test_control_flow'}, {'test_file': 'test_modules'}, {'test_file': 'nn/test_packed_sequence'}, {'test_file': 'test_numa_binding'}, {'test_file': 'test_pruning_op'}, {'test_file': 'test_jit_fuser_te'}, {'test_file': 'test_autocast'}, {'test_file': 'test_logging'}, {'test_file': 'test_python_dispatch'}, {'test_file': 'nn/test_lazy_modules'}, {'test_file': 'nn/test_pruning'}, {'test_file': 'test_monitor'}, {'test_file': 'test_cuda_sanitizer'}, {'test_file': 'test_bundled_inputs'}, {'test_file': 'torch_np/numpy_tests/core/test_numeric'}, {'test_file': 'torch_np/numpy_tests/core/test_multiarray'}, {'test_file': 'test_itt'}, {'test_file': 'torch_np/numpy_tests/lib/test_function_base'}, {'test_file': 'test_masked'}, {'test_file': 'test_sympy_utils'}, {'test_file': 'test_jit_disabled'}, {'test_file': 'test_subclass'}, {'test_file': 'test_import_stats'}, {'test_file': 'functorch/test_vmap_registrations'}, {'test_file': 'nn/test_parametrization'}, {'test_file': 'test_mkldnn_fusion'}, {'test_file': 'test_cpp_extensions_mtia_backend'}, {'test_file': 'lazy/test_ts_opinfo'}, {'test_file': 'test_dynamic_shapes'}, {'test_file': 'complex_tensor/test_complex_tensor'}, {'test_file': 'optim/test_lrscheduler'}, {'test_file': 'optim/test_swa_utils'}, {'test_file': 'cpp_extensions/python_agnostic_extension/test/test_python_agnostic'}, {'test_file': 'functorch/test_memory_efficient_fusion'}, {'test_file': 'torch_np/numpy_tests/lib/test_histograms'}, {'test_file': 'torch_np/test_indexing'}, {'test_file': 'test_schema_check'}, {'test_file': 'test_tensorboard'}, {'test_file': 'torch_np/numpy_tests/core/test_indexing'}, {'test_file': 'test_futures'}, {'test_file': 'test_tensor_creation_ops'}, {'test_file': 'nn/test_dropout'}, {'test_file': 'functorch/dim/test_split'}, {'test_file': 'torch_np/numpy_tests/lib/test_type_check'}, {'test_file': 'cpp_extensions/test_libtorch_agnostic'}, {'test_file': 'test_cpp_extensions_stream_and_event'}, {'test_file': 'profiler/test_execution_trace'}, {'test_file': 'test_jit'}, {'test_file': 'test_dispatch'}, {'test_file': 'test_datapipe'}, {'test_file': 'test_numba_integration'}, {'test_file': 'test_functional_optim'}, {'test_file': 'test_maskedtensor'}, {'test_file': 'benchmark_utils/test_benchmark_utils'}, {'test_file': 'torch_np/numpy_tests/core/test_scalarmath'}, {'test_file': 'test_scaled_matmul_cuda'}, {'test_file': 'torch_np/numpy_tests/core/test_shape_base'}, {'test_file': 'test_vulkan'}, {'test_file': 'lazy/test_generator'}, {'test_file': 'nn/test_convolution'}, {'test_file': 'torch_np/numpy_tests/linalg/test_linalg'}, {'test_file': 'torch_np/numpy_tests/core/test_dtype'}, {'test_file': 'lazy/test_debug_util'}, {'test_file': 'nn/test_load_state_dict'}, {'test_file': 'test_shape_ops'}, {'test_file': 'nn/test_module_hooks'}, {'test_file': 'torch_np/numpy_tests/lib/test_twodim_base'}, {'test_file': 'profiler/test_memory_profiler'}, {'test_file': 'test_jit_llga_fuser'}, {'test_file': 'test_serialization'}, {'test_file': 'optim/test_optim'}, {'test_file': 'test_indexing'}, {'test_file': 'torch_np/numpy_tests/fft/test_pocketfft'}, {'test_file': 'torch_np/numpy_tests/lib/test_shape_base_'}, {'test_file': 'test_cpp_extensions_jit'}, {'test_file': 'torch_np/numpy_tests/core/test_getlimits'}, {'test_file': 'torch_np/test_ndarray_methods'}, {'test_file': 'test_view_ops'}, {'test_file': 'test_type_info'}, {'test_file': 'functorch/test_aotdispatch'}, {'test_file': 'test_nn'}, {'test_file': 'torch_np/numpy_tests/core/test_dlpack'}, {'test_file': 'test_multiprocessing_spawn'}, {'test_file': 'test_scatter_gather_ops'}, {'test_file': 'test_cuda_multigpu'}, {'test_file': 'test_mkldnn'}, {'test_file': 'torch_np/numpy_tests/lib/test_index_tricks'}, {'test_file': 'test_jit_autocast'}, {'test_file': 'nn/test_pooling'}, {'test_file': 'nn/test_embedding'}, {'test_file': 'test_unary_ufuncs'}, {'test_file': 'test_xnnpack_integration'}, {'test_file': 'test_cuda_trace'}, {'test_file': 'test_native_mha'}, {'test_file': 'torch_np/numpy_tests/core/test_numerictypes'}, {'test_file': 'test_cuda_nvml_based_avail'}, {'test_file': 'test_function_schema'}, {'test_file': 'test_accelerator'}, {'test_file': 'nn/test_init'}, {'test_file': 'torch_np/numpy_tests/core/test_scalar_methods'}, {'test_file': 'torch_np/numpy_tests/fft/test_helper'}, {'test_file': 'test_mobile_optimizer'}, {'test_file': 'torch_np/test_function_base'}, {'test_file': 'test_type_promotion'}, {'test_file': 'torch_np/test_scalars_0D_arrays'}, {'test_file': 'test_cuda_primary_ctx'}, {'test_file': 'profiler/test_profiler_tree'}, {'test_file': 'torch_np/numpy_tests/lib/test_arraysetops'}, {'test_file': 'test_dlpack'}, {'test_file': 'profiler/test_torch_tidy'}, {'test_file': 'lazy/test_reuse_ir'}, {'test_file': 'test_functional_autograd_benchmark'}, {'test_file': 'test_reductions'}, {'test_file': 'torch_np/test_reductions'}, {'test_file': 'torch_np/numpy_tests/core/test_scalar_ctors'}, {'test_file': 'torch_np/numpy_tests/lib/test_arraypad'}, {'test_file': 'test_prims'}, {'test_file': 'test_spectral_ops'}, {'test_file': 'profiler/test_python_tracer'}, {'test_file': 'cpp_extensions/libtorch_agnostic_2_10_extension/test_version_compatibility'}, {'test_file': 'distributions/test_distributions'}, {'test_file': 'test_autoload_disable'}, {'test_file': 'test_autoload_enable'}, {'test_file': 'test_cpp_extensions_aot_ninja'}, {'test_file': 'test_cpp_extensions_aot_no_ninja'}], 'excluded': []} from test/test-reports/td_exclusions-3a043a6734479fe41403.json is not a benchmark record, skipping 2025-12-04T10:27:50.6454681Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T10:27:50.6457657Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/Dict_test'}, {'test_file': 'cpp/Dimname_test'}, {'test_file': 'cpp/NamedTensor_test'}, {'test_file': 'cpp/apply_utils_test'}, {'test_file': 'cpp/atest'}, {'test_file': 'cpp/basic'}, {'test_file': 'cpp/broadcast_test'}, {'test_file': 'cpp/cpu_generator_test'}, {'test_file': 'cpp/dlconvertor_test'}, {'test_file': 'cpp/extension_backend_test'}, {'test_file': 'cpp/lazy_tensor_test'}, {'test_file': 'cpp/legacy_vmap_test'}, {'test_file': 'cpp/native_test'}, {'test_file': 'cpp/operators_test'}, {'test_file': 'cpp/scalar_tensor_test'}, {'test_file': 'cpp/scalar_test'}, {'test_file': 'cpp/tensor_iterator_test'}, {'test_file': 'cpp/undefined_tensor_test'}, {'test_file': 'cpp/wrapdim_test'}], 'excluded': []} from test/test-reports/td_exclusions-90f443b2f3798464ae25.json is not a benchmark record, skipping 2025-12-04T10:27:50.6460323Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T10:27:50.6461473Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/cuda_generator_test'}], 'excluded': []} from test/test-reports/td_exclusions-5ff4b9e6317dfa34c9b9.json is not a benchmark record, skipping 2025-12-04T10:27:50.6462605Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T10:27:50.6463727Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/cuda_half_test'}], 'excluded': []} from test/test-reports/td_exclusions-48af0502552256dc5704.json is not a benchmark record, skipping 2025-12-04T10:27:50.6464900Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T10:27:50.6466026Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/cuda_vectorized_test'}], 'excluded': []} from test/test-reports/td_exclusions-c17e071ac54ea82196a6.json is not a benchmark record, skipping 2025-12-04T10:27:50.6467139Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T10:27:50.6468351Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/cuda_distributions_test'}], 'excluded': []} from test/test-reports/td_exclusions-31870802a1378b4dbeae.json is not a benchmark record, skipping 2025-12-04T10:27:50.6469486Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T10:27:50.6470622Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/cuda_optional_test'}], 'excluded': []} from test/test-reports/td_exclusions-30780a9bb566ac6b829d.json is not a benchmark record, skipping 2025-12-04T10:27:50.6471725Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T10:27:50.6472824Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/cuda_complex_test'}], 'excluded': []} from test/test-reports/td_exclusions-ea5911a2c14300cdb4b7.json is not a benchmark record, skipping 2025-12-04T10:27:50.6473931Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T10:27:50.6475058Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/cuda_complex_math_test'}], 'excluded': []} from test/test-reports/td_exclusions-86b2d82fa78ec494ed8f.json is not a benchmark record, skipping 2025-12-04T10:27:50.6476262Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T10:27:50.6477359Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/cuda_cub_test'}], 'excluded': []} from test/test-reports/td_exclusions-53dbf991d6bd20536317.json is not a benchmark record, skipping 2025-12-04T10:27:50.6478480Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T10:27:50.6479587Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/cuda_atomic_ops_test'}], 'excluded': []} from test/test-reports/td_exclusions-4decd56e60a51c28cd5b.json is not a benchmark record, skipping 2025-12-04T10:27:50.6480694Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T10:27:50.6481808Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/cuda_allocator_test'}], 'excluded': []} from test/test-reports/td_exclusions-00d12e2557f6e31d34c4.json is not a benchmark record, skipping 2025-12-04T10:27:50.6482918Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T10:27:50.6484033Z /home/ec2-user/actions-runner/_work/_actions/pytorch/test-infra/main/.github/actions/upload-benchmark-results/../../scripts/upload_benchmark_results.py:236: UserWarning: {'included': [{'test_file': 'cpp/test_api'}], 'excluded': []} from test/test-reports/td_exclusions-dd384fabedccdc3247b2.json is not a benchmark record, skipping 2025-12-04T10:27:50.6485105Z warn(f"{result} from {filepath} is not a benchmark record, skipping") 2025-12-04T10:27:50.6569287Z ##[group]Run cat test/**/*_toprint.log || true 2025-12-04T10:27:50.6569626Z cat test/**/*_toprint.log || true 2025-12-04T10:27:50.6577527Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T10:27:50.6577797Z env: 2025-12-04T10:27:50.6577959Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:50.6578154Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:50.6578389Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:50.6578809Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:50.6579181Z DEVICE_NAME: 2025-12-04T10:27:50.6579347Z DEVICE_TYPE: 2025-12-04T10:27:50.6579496Z ##[endgroup] 2025-12-04T10:27:50.6738213Z Test results will be stored in test-reports/python-pytest/test_ci_sanity_check_fail/test_ci_sanity_check_fail-09b2f72c46f7df3f.xml 2025-12-04T10:27:50.6738879Z ============================= test session starts ============================== 2025-12-04T10:27:50.6739440Z platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.6.0 -- /opt/conda/envs/py_3.10/bin/python 2025-12-04T10:27:50.6739897Z cachedir: .pytest_cache 2025-12-04T10:27:50.6740437Z hypothesis profile 'pytorch_ci' -> database=None, max_examples=50, derandomize=True, suppress_health_check=[HealthCheck.too_slow] 2025-12-04T10:27:50.6751524Z rootdir: /var/lib/jenkins/workspace 2025-12-04T10:27:50.6751830Z configfile: pytest.ini 2025-12-04T10:27:50.6752454Z plugins: hypothesis-6.56.4, cpp-2.3.0, flakefinder-1.1.0, rerunfailures-14.0, subtests-0.13.1, xdist-3.3.1, xdoctest-1.3.0, typeguard-4.3.0 2025-12-04T10:27:50.6752959Z collecting ... collected 2 items 2025-12-04T10:27:50.6753213Z stepcurrent: Cannot find last run test, not skipping 2025-12-04T10:27:50.6753480Z Running 0 items in this shard: 2025-12-04T10:27:50.6753614Z 2025-12-04T10:27:50.6754059Z - generated xml file: /var/lib/jenkins/workspace/test/test-reports/python-pytest/test_ci_sanity_check_fail/test_ci_sanity_check_fail-09b2f72c46f7df3f.xml - 2025-12-04T10:27:50.6754659Z ============================ no tests ran in 0.01s ============================= 2025-12-04T10:27:50.6847255Z ##[group]Run kill "$MONITOR_SCRIPT_PID" 2025-12-04T10:27:50.6847556Z kill "$MONITOR_SCRIPT_PID" 2025-12-04T10:27:50.6854426Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T10:27:50.6854719Z env: 2025-12-04T10:27:50.6854872Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:50.6855061Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:50.6855684Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:50.6856087Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:50.6856441Z DEVICE_NAME: 2025-12-04T10:27:50.6856600Z DEVICE_TYPE: 2025-12-04T10:27:50.6856759Z MONITOR_SCRIPT_PID: 60826 2025-12-04T10:27:50.6856949Z ##[endgroup] 2025-12-04T10:27:50.6883508Z /home/ec2-user/actions-runner/_work/_temp/01ca114f-08aa-49c2-a23a-e00c016adb5c.sh: line 1: kill: (60826) - No such process 2025-12-04T10:27:50.6895293Z ##[error]Process completed with exit code 1. 2025-12-04T10:27:50.6999338Z Prepare all required actions 2025-12-04T10:27:50.6999697Z Getting action download info 2025-12-04T10:27:50.8461871Z Download action repository 'seemethere/upload-artifact-s3@v5' (SHA:baba72d0712b404f646cebe0730933554ebce96a) 2025-12-04T10:27:51.4376854Z Download action repository 'actions/upload-artifact@v4' (SHA:ea165f8d65b6e75b540449e92b4886f43607fa02) 2025-12-04T10:27:53.5402762Z ##[group]Run ./.github/actions/upload-test-artifacts 2025-12-04T10:27:53.5403134Z with: 2025-12-04T10:27:53.5403445Z file-suffix: test-default-1-7-linux.g6.4xlarge.experimental.nvidia.gpu_57120265563 2025-12-04T10:27:53.5403822Z s3-bucket: gha-artifacts 2025-12-04T10:27:53.5404005Z env: 2025-12-04T10:27:53.5404165Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:53.5404356Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:53.5404578Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:53.5404978Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:53.5405362Z DEVICE_NAME: 2025-12-04T10:27:53.5405536Z DEVICE_TYPE: 2025-12-04T10:27:53.5405700Z ##[endgroup] 2025-12-04T10:27:53.5460436Z ##[group]Run # Remove any previous test jsons if they exist 2025-12-04T10:27:53.5460767Z # Remove any previous test jsons if they exist 2025-12-04T10:27:53.5461037Z rm -f test-jsons-*.zip 2025-12-04T10:27:53.5461352Z zip -r "test-jsons-${FILE_SUFFIX}.zip" test/test-reports -i '*.json' 2025-12-04T10:27:53.5469342Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T10:27:53.5469615Z env: 2025-12-04T10:27:53.5469768Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:53.5469958Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:53.5470173Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:53.5470569Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:53.5470920Z DEVICE_NAME: 2025-12-04T10:27:53.5471074Z DEVICE_TYPE: 2025-12-04T10:27:53.5471399Z FILE_SUFFIX: test-default-1-7-linux.g6.4xlarge.experimental.nvidia.gpu_57120265563 2025-12-04T10:27:53.5471750Z ##[endgroup] 2025-12-04T10:27:53.6281383Z adding: test/test-reports/td_exclusions-e77b946010afe336c823.json (deflated 16%) 2025-12-04T10:27:53.6282147Z adding: test/test-reports/python-pytest/lazy.test_ts_opinfo/lazy.test_ts_opinfo-dfb44ef243b54b76.json (stored 0%) 2025-12-04T10:27:53.6282955Z adding: test/test-reports/python-pytest/lazy.test_ts_opinfo/lazy.test_ts_opinfo-f86a1ea8b3ea1cce.json (stored 0%) 2025-12-04T10:27:53.7118475Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-8842d0c0a55c3e44.json (deflated 98%) 2025-12-04T10:27:53.7119489Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-df8e471be02986ee.json (stored 0%) 2025-12-04T10:27:53.7133635Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-de4e116bf43af918.json (stored 0%) 2025-12-04T10:27:53.7134486Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-2a982e23b7b97d08.json (stored 0%) 2025-12-04T10:27:53.7152461Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-175ab23e93e8bbac.json (deflated 99%) 2025-12-04T10:27:53.7153508Z adding: test/test-reports/python-pytest/test_privateuseone_python_backend/test_privateuseone_python_backend-c28ba098140a7833.json (stored 0%) 2025-12-04T10:27:53.7154663Z adding: test/test-reports/python-pytest/test_ci_sanity_check_fail/test_ci_sanity_check_fail-09b2f72c46f7df3f.json (stored 0%) 2025-12-04T10:27:53.7158453Z adding: test/test-reports/python-pytest/test_overrides/test_overrides-d70ad67a6515a66b.json (deflated 99%) 2025-12-04T10:27:53.7161353Z adding: test/test-reports/python-pytest/inductor.test_benchmark_fusion/inductor.test_benchmark_fusion-74d740f721c794b6.json (deflated 99%) 2025-12-04T10:27:53.7162400Z adding: test/test-reports/python-pytest/inductor.test_distributed_patterns/inductor.test_distributed_patterns-f972528c27d5475e.json (stored 0%) 2025-12-04T10:27:53.7163229Z adding: test/test-reports/python-pytest/dynamo.test_fake_distributed/dynamo.test_fake_distributed-f18500af782cc14f.json (stored 0%) 2025-12-04T10:27:53.7163926Z adding: test/test-reports/python-pytest/test_sort_and_select/test_sort_and_select-1f54bb39d728015e.json (stored 0%) 2025-12-04T10:27:53.7164647Z adding: test/test-reports/python-pytest/test_cpp_api_parity/test_cpp_api_parity-4ce4457e70f486b9.json (stored 0%) 2025-12-04T10:27:53.7165299Z adding: test/test-reports/python-pytest/test_extension_utils/test_extension_utils-2e393af7d1353d9f.json (stored 0%) 2025-12-04T10:27:53.7165924Z adding: test/test-reports/python-pytest/test_show_pickle/test_show_pickle-06dd150985a7f3b0.json (stored 0%) 2025-12-04T10:27:53.7166512Z adding: test/test-reports/python-pytest/test_torch/test_torch-161156eb485440fd.json (deflated 98%) 2025-12-04T10:27:53.7167090Z adding: test/test-reports/python-pytest/test_tensorexpr/test_tensorexpr-cc73ec26257e6848.json (stored 0%) 2025-12-04T10:27:53.7169226Z adding: test/test-reports/python-pytest/test_utils/test_utils-dc7ffe8b75564894.json (deflated 99%) 2025-12-04T10:27:53.7169880Z adding: test/test-reports/python-pytest/test_namedtuple_return_api/test_namedtuple_return_api-0528ea89b6c462b6.json (stored 0%) 2025-12-04T10:27:53.7170767Z adding: test/test-reports/python-pytest/test_fake_tensor/test_fake_tensor-541627ef745602ac.json (deflated 95%) 2025-12-04T10:27:53.7173883Z adding: test/test-reports/python-pytest/test_multiprocessing/test_multiprocessing-59f445c48e82dcaa.json (deflated 98%) 2025-12-04T10:27:53.7192248Z adding: test/test-reports/python-pytest/test_fx/test_fx-8e8ec79e212b88b9.json (deflated 99%) 2025-12-04T10:27:53.7193048Z adding: test/test-reports/python-pytest/test_autograd_fallback/test_autograd_fallback-8bc86f9f976d5210.json (stored 0%) 2025-12-04T10:27:53.7193878Z adding: test/test-reports/python-pytest/test_autocast/test_autocast-260662fd6260d97e.json (stored 0%) 2025-12-04T10:27:53.7194665Z adding: test/test-reports/python-pytest/test_python_dispatch/test_python_dispatch-ac7034bd8d91ec1a.json (stored 0%) 2025-12-04T10:27:53.7195538Z adding: test/test-reports/python-pytest/test_jit_disabled/test_jit_disabled-38a8accee470b174.json (stored 0%) 2025-12-04T10:27:53.7196597Z adding: test/test-reports/python-pytest/test_cpp_extensions_mtia_backend/test_cpp_extensions_mtia_backend-c1c0a2e49ca1a379.json (stored 0%) 2025-12-04T10:27:53.7197581Z adding: test/test-reports/python-pytest/functorch.test_memory_efficient_fusion/functorch.test_memory_efficient_fusion-dd393fbc07d99e9e.json (stored 0%) 2025-12-04T10:27:53.7198387Z adding: test/test-reports/python-pytest/test_tensor_creation_ops/test_tensor_creation_ops-09e3e0f157e06752.json (stored 0%) 2025-12-04T10:27:53.7199155Z adding: test/test-reports/python-pytest/test_cpp_extensions_stream_and_event/test_cpp_extensions_stream_and_event-cb8aaf0c2b78a127.json (stored 0%) 2025-12-04T10:27:53.7199846Z adding: test/test-reports/python-pytest/test_dispatch/test_dispatch-fae8bf7b5906c582.json (stored 0%) 2025-12-04T10:27:53.7200467Z adding: test/test-reports/python-pytest/nn.test_convolution/nn.test_convolution-4066c5253990dd79.json (deflated 98%) 2025-12-04T10:27:53.7201145Z adding: test/test-reports/python-pytest/test_cpp_extensions_jit/test_cpp_extensions_jit-1d0408224d2abc94.json (stored 0%) 2025-12-04T10:27:53.7201884Z adding: test/test-reports/python-pytest/test_nn/test_nn-7a49688264af9155.json (deflated 99%) 2025-12-04T10:27:53.7202509Z adding: test/test-reports/python-pytest/test_multiprocessing_spawn/test_multiprocessing_spawn-5b6e250b7bbb2ba6.json (stored 0%) 2025-12-04T10:27:53.7203176Z adding: test/test-reports/python-pytest/nn.test_pooling/nn.test_pooling-6222189819ddcf1e.json (stored 0%) 2025-12-04T10:27:53.7203762Z adding: test/test-reports/python-pytest/test_native_mha/test_native_mha-bac556999acb8bd6.json (stored 0%) 2025-12-04T10:27:53.7204534Z adding: test/test-reports/python-pytest/test_mobile_optimizer/test_mobile_optimizer-7cf39d7714d1461e.json (deflated 99%) 2025-12-04T10:27:53.7205200Z adding: test/test-reports/python-pytest/test_reductions/test_reductions-8fa1f3b895437bdd.json (stored 0%) 2025-12-04T10:27:53.7205821Z adding: test/test-reports/python-pytest/test_spectral_ops/test_spectral_ops-b3af31a20fb8ad2a.json (stored 0%) 2025-12-04T10:27:53.7206574Z adding: test/test-reports/python-pytest/distributions.test_distributions/distributions.test_distributions-2a85008eb39e7213.json (deflated 98%) 2025-12-04T10:27:53.7207416Z adding: test/test-reports/python-pytest/test_cpp_extensions_aot_ninja/test_cpp_extensions_aot_ninja-7e3d25a89f42eb08.json (stored 0%) 2025-12-04T10:27:53.7208169Z adding: test/test-reports/python-pytest/test_cpp_extensions_aot_no_ninja/test_cpp_extensions_aot_no_ninja-b5fa4f992440af7c.json (stored 0%) 2025-12-04T10:27:53.7208991Z adding: test/test-reports/python-pytest/inductor.test_collective_autotuning/inductor.test_collective_autotuning-26de2e09af9201ce.json (stored 0%) 2025-12-04T10:27:53.7209841Z adding: test/test-reports/python-pytest/inductor.test_collective_autotuning/inductor.test_collective_autotuning-d51a031c86e0e3ba.json (stored 0%) 2025-12-04T10:27:53.7210670Z adding: test/test-reports/python-pytest/inductor.test_aot_inductor_utils/inductor.test_aot_inductor_utils-6741be3d1dc90f7c.json (stored 0%) 2025-12-04T10:27:53.7211462Z adding: test/test-reports/python-pytest/inductor.test_aot_inductor_utils/inductor.test_aot_inductor_utils-ee8350aedee45242.json (stored 0%) 2025-12-04T10:27:53.7212241Z adding: test/test-reports/python-pytest/dynamo.test_graph_region_tracker/dynamo.test_graph_region_tracker-29cd727ff584e69e.json (stored 0%) 2025-12-04T10:27:53.7213023Z adding: test/test-reports/python-pytest/dynamo.test_graph_region_tracker/dynamo.test_graph_region_tracker-06f3ef79430d8b50.json (stored 0%) 2025-12-04T10:27:53.7213766Z adding: test/test-reports/python-pytest/dynamo.test_unittest/dynamo.test_unittest-2dedd17aa5d99b38.json (stored 0%) 2025-12-04T10:27:53.7214425Z adding: test/test-reports/python-pytest/dynamo.test_unittest/dynamo.test_unittest-8729607d2891c907.json (stored 0%) 2025-12-04T10:27:53.7215130Z adding: test/test-reports/python-pytest/inductor.test_compile/inductor.test_compile-1f9b47dff76a97ed.json (stored 0%) 2025-12-04T10:27:53.7215835Z adding: test/test-reports/python-pytest/inductor.test_compile/inductor.test_compile-de6ee8a8e08eed81.json (stored 0%) 2025-12-04T10:27:53.7216512Z adding: test/test-reports/python-pytest/dynamo.test_functions/dynamo.test_functions-abfa29da9b6f3fb3.json (stored 0%) 2025-12-04T10:27:53.7217185Z adding: test/test-reports/python-pytest/dynamo.test_functions/dynamo.test_functions-bf5934b50d848f7f.json (stored 0%) 2025-12-04T10:27:53.7217880Z adding: test/test-reports/python-pytest/inductor.test_ordered_set/inductor.test_ordered_set-d135623a87a3c057.json (stored 0%) 2025-12-04T10:27:53.7218596Z adding: test/test-reports/python-pytest/inductor.test_ordered_set/inductor.test_ordered_set-cd0f012a3cd8e7fd.json (stored 0%) 2025-12-04T10:27:53.7219343Z adding: test/test-reports/python-pytest/dynamo.test_install_free_tensors/dynamo.test_install_free_tensors-b1819f6ae3648480.json (stored 0%) 2025-12-04T10:27:53.7220124Z adding: test/test-reports/python-pytest/dynamo.test_install_free_tensors/dynamo.test_install_free_tensors-59f3a298f57a1d82.json (stored 0%) 2025-12-04T10:27:53.7221084Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_codegen_config_overrides/inductor.test_torchinductor_codegen_config_overrides-9553b7353fc11e83.json (stored 0%) 2025-12-04T10:27:53.7222118Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_codegen_config_overrides/inductor.test_torchinductor_codegen_config_overrides-9b006bc55755cf8f.json (stored 0%) 2025-12-04T10:27:53.7222955Z adding: test/test-reports/python-pytest/export.test_passes/export.test_passes-6a90abd3a745b76d.json (stored 0%) 2025-12-04T10:27:53.7224184Z adding: test/test-reports/python-pytest/export.test_passes/export.test_passes-3e908af53b230225.json (stored 0%) 2025-12-04T10:27:53.7225417Z adding: test/test-reports/python-pytest/dynamo.test_autograd_function/dynamo.test_autograd_function-00c0455c5def0d5c.json (stored 0%) 2025-12-04T10:27:53.7226534Z adding: test/test-reports/python-pytest/dynamo.test_autograd_function/dynamo.test_autograd_function-4ac3d5673f9a4827.json (stored 0%) 2025-12-04T10:27:53.7227463Z adding: test/test-reports/python-pytest/inductor.test_codecache/inductor.test_codecache-0a4f995bcf28ccb9.json (stored 0%) 2025-12-04T10:27:53.7228176Z adding: test/test-reports/python-pytest/inductor.test_codecache/inductor.test_codecache-5b172dd2c0b9882d.json (stored 0%) 2025-12-04T10:27:53.7228936Z adding: test/test-reports/python-pytest/complex_tensor.test_complex_tensor/complex_tensor.test_complex_tensor-7f6e01a72670401d.json (stored 0%) 2025-12-04T10:27:53.7229753Z adding: test/test-reports/python-pytest/complex_tensor.test_complex_tensor/complex_tensor.test_complex_tensor-57b4d0d1903d84ca.json (stored 0%) 2025-12-04T10:27:53.7230391Z adding: test/test-reports/td_exclusions-3a043a6734479fe41403.json (deflated 82%) 2025-12-04T10:27:53.7230987Z adding: test/test-reports/python-unittest/test_autoload/TEST-TestDeviceBackendAutoload-20251204101548.json (deflated 37%) 2025-12-04T10:27:53.7231718Z adding: test/test-reports/python-unittest/test_autoload/TEST-TestDeviceBackendAutoload-20251204101709.json (deflated 37%) 2025-12-04T10:27:53.7232313Z adding: test/test-reports/td_exclusions-90f443b2f3798464ae25.json (deflated 73%) 2025-12-04T10:27:53.7232752Z adding: test/test-reports/td_exclusions-5ff4b9e6317dfa34c9b9.json (deflated 14%) 2025-12-04T10:27:53.7233185Z adding: test/test-reports/td_exclusions-48af0502552256dc5704.json (deflated 15%) 2025-12-04T10:27:53.7233613Z adding: test/test-reports/td_exclusions-c17e071ac54ea82196a6.json (deflated 14%) 2025-12-04T10:27:53.7234041Z adding: test/test-reports/td_exclusions-31870802a1378b4dbeae.json (deflated 13%) 2025-12-04T10:27:53.7234489Z adding: test/test-reports/td_exclusions-30780a9bb566ac6b829d.json (deflated 14%) 2025-12-04T10:27:53.7234920Z adding: test/test-reports/td_exclusions-ea5911a2c14300cdb4b7.json (deflated 14%) 2025-12-04T10:27:53.7235354Z adding: test/test-reports/td_exclusions-86b2d82fa78ec494ed8f.json (deflated 13%) 2025-12-04T10:27:53.7235784Z adding: test/test-reports/td_exclusions-53dbf991d6bd20536317.json (deflated 15%) 2025-12-04T10:27:53.7236218Z adding: test/test-reports/td_exclusions-4decd56e60a51c28cd5b.json (deflated 14%) 2025-12-04T10:27:53.7236656Z adding: test/test-reports/td_exclusions-00d12e2557f6e31d34c4.json (deflated 14%) 2025-12-04T10:27:53.7237087Z adding: test/test-reports/td_exclusions-dd384fabedccdc3247b2.json (deflated 18%) 2025-12-04T10:27:53.7269874Z ##[group]Run # Remove any previous test reports if they exist 2025-12-04T10:27:53.7270215Z # Remove any previous test reports if they exist 2025-12-04T10:27:53.7270488Z rm -f test-reports-*.zip 2025-12-04T10:27:53.7270818Z zip -r "test-reports-${FILE_SUFFIX}.zip" test/test-reports -i '*.xml' -i '*.csv' 2025-12-04T10:27:53.7278172Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T10:27:53.7278442Z env: 2025-12-04T10:27:53.7278598Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:53.7278791Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:53.7279018Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:53.7279517Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:53.7279874Z DEVICE_NAME: 2025-12-04T10:27:53.7280027Z DEVICE_TYPE: 2025-12-04T10:27:53.7280324Z FILE_SUFFIX: test-default-1-7-linux.g6.4xlarge.experimental.nvidia.gpu_57120265563 2025-12-04T10:27:53.7280667Z ##[endgroup] 2025-12-04T10:27:53.7393193Z adding: test/test-reports/python-pytest/lazy.test_ts_opinfo/lazy.test_ts_opinfo-dfb44ef243b54b76.xml (deflated 28%) 2025-12-04T10:27:53.7394269Z adding: test/test-reports/python-pytest/lazy.test_ts_opinfo/lazy.test_ts_opinfo-f86a1ea8b3ea1cce.xml (deflated 28%) 2025-12-04T10:27:53.8215981Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-8842d0c0a55c3e44.xml (deflated 98%) 2025-12-04T10:27:53.8216994Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-df8e471be02986ee.xml (deflated 28%) 2025-12-04T10:27:53.8218158Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-de4e116bf43af918.xml (deflated 28%) 2025-12-04T10:27:53.8219122Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-2a982e23b7b97d08.xml (deflated 28%) 2025-12-04T10:27:53.8248960Z adding: test/test-reports/python-pytest/inductor.test_flex_attention/inductor.test_flex_attention-175ab23e93e8bbac.xml (deflated 99%) 2025-12-04T10:27:53.8250004Z adding: test/test-reports/python-pytest/test_privateuseone_python_backend/test_privateuseone_python_backend-c28ba098140a7833.xml (deflated 28%) 2025-12-04T10:27:53.8250974Z adding: test/test-reports/python-pytest/test_ci_sanity_check_fail/test_ci_sanity_check_fail-09b2f72c46f7df3f.xml (deflated 28%) 2025-12-04T10:27:53.8253631Z adding: test/test-reports/python-pytest/test_overrides/test_overrides-d70ad67a6515a66b.xml (deflated 99%) 2025-12-04T10:27:53.8256319Z adding: test/test-reports/python-pytest/inductor.test_benchmark_fusion/inductor.test_benchmark_fusion-74d740f721c794b6.xml (deflated 98%) 2025-12-04T10:27:53.8257433Z adding: test/test-reports/python-pytest/inductor.test_distributed_patterns/inductor.test_distributed_patterns-f972528c27d5475e.xml (deflated 28%) 2025-12-04T10:27:53.8258280Z adding: test/test-reports/python-pytest/dynamo.test_fake_distributed/dynamo.test_fake_distributed-f18500af782cc14f.xml (deflated 28%) 2025-12-04T10:27:53.8259005Z adding: test/test-reports/python-pytest/test_sort_and_select/test_sort_and_select-1f54bb39d728015e.xml (deflated 28%) 2025-12-04T10:27:53.8259659Z adding: test/test-reports/python-pytest/test_cpp_api_parity/test_cpp_api_parity-4ce4457e70f486b9.xml (deflated 28%) 2025-12-04T10:27:53.8260311Z adding: test/test-reports/python-pytest/test_extension_utils/test_extension_utils-2e393af7d1353d9f.xml (deflated 28%) 2025-12-04T10:27:53.8260946Z adding: test/test-reports/python-pytest/test_show_pickle/test_show_pickle-06dd150985a7f3b0.xml (deflated 28%) 2025-12-04T10:27:53.8261533Z adding: test/test-reports/python-pytest/test_torch/test_torch-161156eb485440fd.xml (deflated 97%) 2025-12-04T10:27:53.8262118Z adding: test/test-reports/python-pytest/test_tensorexpr/test_tensorexpr-cc73ec26257e6848.xml (deflated 28%) 2025-12-04T10:27:53.8263856Z adding: test/test-reports/python-pytest/test_utils/test_utils-dc7ffe8b75564894.xml (deflated 99%) 2025-12-04T10:27:53.8264499Z adding: test/test-reports/python-pytest/test_namedtuple_return_api/test_namedtuple_return_api-0528ea89b6c462b6.xml (deflated 28%) 2025-12-04T10:27:53.8265192Z adding: test/test-reports/python-pytest/test_fake_tensor/test_fake_tensor-541627ef745602ac.xml (deflated 91%) 2025-12-04T10:27:53.8267884Z adding: test/test-reports/python-pytest/test_multiprocessing/test_multiprocessing-59f445c48e82dcaa.xml (deflated 98%) 2025-12-04T10:27:53.8285433Z adding: test/test-reports/python-pytest/test_fx/test_fx-8e8ec79e212b88b9.xml (deflated 99%) 2025-12-04T10:27:53.8286318Z adding: test/test-reports/python-pytest/test_autograd_fallback/test_autograd_fallback-8bc86f9f976d5210.xml (deflated 28%) 2025-12-04T10:27:53.8287139Z adding: test/test-reports/python-pytest/test_autocast/test_autocast-260662fd6260d97e.xml (deflated 28%) 2025-12-04T10:27:53.8287922Z adding: test/test-reports/python-pytest/test_python_dispatch/test_python_dispatch-ac7034bd8d91ec1a.xml (deflated 28%) 2025-12-04T10:27:53.8288733Z adding: test/test-reports/python-pytest/test_jit_disabled/test_jit_disabled-38a8accee470b174.xml (deflated 27%) 2025-12-04T10:27:53.8289771Z adding: test/test-reports/python-pytest/test_cpp_extensions_mtia_backend/test_cpp_extensions_mtia_backend-c1c0a2e49ca1a379.xml (deflated 28%) 2025-12-04T10:27:53.8290854Z adding: test/test-reports/python-pytest/functorch.test_memory_efficient_fusion/functorch.test_memory_efficient_fusion-dd393fbc07d99e9e.xml (deflated 27%) 2025-12-04T10:27:53.8291845Z adding: test/test-reports/python-pytest/test_tensor_creation_ops/test_tensor_creation_ops-09e3e0f157e06752.xml (deflated 28%) 2025-12-04T10:27:53.8292880Z adding: test/test-reports/python-pytest/test_cpp_extensions_stream_and_event/test_cpp_extensions_stream_and_event-cb8aaf0c2b78a127.xml (deflated 27%) 2025-12-04T10:27:53.8293767Z adding: test/test-reports/python-pytest/test_dispatch/test_dispatch-fae8bf7b5906c582.xml (deflated 28%) 2025-12-04T10:27:53.8294538Z adding: test/test-reports/python-pytest/nn.test_convolution/nn.test_convolution-4066c5253990dd79.xml (deflated 98%) 2025-12-04T10:27:53.8295364Z adding: test/test-reports/python-pytest/test_cpp_extensions_jit/test_cpp_extensions_jit-1d0408224d2abc94.xml (deflated 28%) 2025-12-04T10:27:53.8296038Z adding: test/test-reports/python-pytest/test_nn/test_nn-7a49688264af9155.xml (deflated 99%) 2025-12-04T10:27:53.8296664Z adding: test/test-reports/python-pytest/test_multiprocessing_spawn/test_multiprocessing_spawn-5b6e250b7bbb2ba6.xml (deflated 28%) 2025-12-04T10:27:53.8297332Z adding: test/test-reports/python-pytest/nn.test_pooling/nn.test_pooling-6222189819ddcf1e.xml (deflated 27%) 2025-12-04T10:27:53.8297920Z adding: test/test-reports/python-pytest/test_native_mha/test_native_mha-bac556999acb8bd6.xml (deflated 28%) 2025-12-04T10:27:53.8298559Z adding: test/test-reports/python-pytest/test_mobile_optimizer/test_mobile_optimizer-7cf39d7714d1461e.xml (deflated 99%) 2025-12-04T10:27:53.8299195Z adding: test/test-reports/python-pytest/test_reductions/test_reductions-8fa1f3b895437bdd.xml (deflated 28%) 2025-12-04T10:27:53.8299811Z adding: test/test-reports/python-pytest/test_spectral_ops/test_spectral_ops-b3af31a20fb8ad2a.xml (deflated 28%) 2025-12-04T10:27:53.8300552Z adding: test/test-reports/python-pytest/distributions.test_distributions/distributions.test_distributions-2a85008eb39e7213.xml (deflated 98%) 2025-12-04T10:27:53.8301351Z adding: test/test-reports/python-pytest/test_cpp_extensions_aot_ninja/test_cpp_extensions_aot_ninja-7e3d25a89f42eb08.xml (deflated 28%) 2025-12-04T10:27:53.8302245Z adding: test/test-reports/python-pytest/test_cpp_extensions_aot_no_ninja/test_cpp_extensions_aot_no_ninja-b5fa4f992440af7c.xml (deflated 28%) 2025-12-04T10:27:53.8303066Z adding: test/test-reports/python-pytest/inductor.test_collective_autotuning/inductor.test_collective_autotuning-26de2e09af9201ce.xml (deflated 28%) 2025-12-04T10:27:53.8303921Z adding: test/test-reports/python-pytest/inductor.test_collective_autotuning/inductor.test_collective_autotuning-d51a031c86e0e3ba.xml (deflated 28%) 2025-12-04T10:27:53.8304744Z adding: test/test-reports/python-pytest/inductor.test_aot_inductor_utils/inductor.test_aot_inductor_utils-6741be3d1dc90f7c.xml (deflated 28%) 2025-12-04T10:27:53.8305549Z adding: test/test-reports/python-pytest/inductor.test_aot_inductor_utils/inductor.test_aot_inductor_utils-ee8350aedee45242.xml (deflated 28%) 2025-12-04T10:27:53.8306364Z adding: test/test-reports/python-pytest/dynamo.test_graph_region_tracker/dynamo.test_graph_region_tracker-29cd727ff584e69e.xml (deflated 28%) 2025-12-04T10:27:53.8307212Z adding: test/test-reports/python-pytest/dynamo.test_graph_region_tracker/dynamo.test_graph_region_tracker-06f3ef79430d8b50.xml (deflated 28%) 2025-12-04T10:27:53.8308049Z adding: test/test-reports/python-pytest/dynamo.test_unittest/dynamo.test_unittest-2dedd17aa5d99b38.xml (deflated 28%) 2025-12-04T10:27:53.8308713Z adding: test/test-reports/python-pytest/dynamo.test_unittest/dynamo.test_unittest-8729607d2891c907.xml (deflated 28%) 2025-12-04T10:27:53.8309479Z adding: test/test-reports/python-pytest/inductor.test_compile/inductor.test_compile-1f9b47dff76a97ed.xml (deflated 28%) 2025-12-04T10:27:53.8310185Z adding: test/test-reports/python-pytest/inductor.test_compile/inductor.test_compile-de6ee8a8e08eed81.xml (deflated 28%) 2025-12-04T10:27:53.8310874Z adding: test/test-reports/python-pytest/dynamo.test_functions/dynamo.test_functions-abfa29da9b6f3fb3.xml (deflated 28%) 2025-12-04T10:27:53.8311549Z adding: test/test-reports/python-pytest/dynamo.test_functions/dynamo.test_functions-bf5934b50d848f7f.xml (deflated 27%) 2025-12-04T10:27:53.8312284Z adding: test/test-reports/python-pytest/inductor.test_ordered_set/inductor.test_ordered_set-d135623a87a3c057.xml (deflated 28%) 2025-12-04T10:27:53.8313013Z adding: test/test-reports/python-pytest/inductor.test_ordered_set/inductor.test_ordered_set-cd0f012a3cd8e7fd.xml (deflated 28%) 2025-12-04T10:27:53.8313774Z adding: test/test-reports/python-pytest/dynamo.test_install_free_tensors/dynamo.test_install_free_tensors-b1819f6ae3648480.xml (deflated 28%) 2025-12-04T10:27:53.8314744Z adding: test/test-reports/python-pytest/dynamo.test_install_free_tensors/dynamo.test_install_free_tensors-59f3a298f57a1d82.xml (deflated 28%) 2025-12-04T10:27:53.8315685Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_codegen_config_overrides/inductor.test_torchinductor_codegen_config_overrides-9553b7353fc11e83.xml (deflated 28%) 2025-12-04T10:27:53.8316760Z adding: test/test-reports/python-pytest/inductor.test_torchinductor_codegen_config_overrides/inductor.test_torchinductor_codegen_config_overrides-9b006bc55755cf8f.xml (deflated 28%) 2025-12-04T10:27:53.8317615Z adding: test/test-reports/python-pytest/export.test_passes/export.test_passes-6a90abd3a745b76d.xml (deflated 28%) 2025-12-04T10:27:53.8318253Z adding: test/test-reports/python-pytest/export.test_passes/export.test_passes-3e908af53b230225.xml (deflated 28%) 2025-12-04T10:27:53.8318971Z adding: test/test-reports/python-pytest/dynamo.test_autograd_function/dynamo.test_autograd_function-00c0455c5def0d5c.xml (deflated 28%) 2025-12-04T10:27:53.8319754Z adding: test/test-reports/python-pytest/dynamo.test_autograd_function/dynamo.test_autograd_function-4ac3d5673f9a4827.xml (deflated 28%) 2025-12-04T10:27:53.8320503Z adding: test/test-reports/python-pytest/inductor.test_codecache/inductor.test_codecache-0a4f995bcf28ccb9.xml (deflated 28%) 2025-12-04T10:27:53.8321223Z adding: test/test-reports/python-pytest/inductor.test_codecache/inductor.test_codecache-5b172dd2c0b9882d.xml (deflated 28%) 2025-12-04T10:27:53.8321996Z adding: test/test-reports/python-pytest/complex_tensor.test_complex_tensor/complex_tensor.test_complex_tensor-7f6e01a72670401d.xml (deflated 28%) 2025-12-04T10:27:53.8322818Z adding: test/test-reports/python-pytest/complex_tensor.test_complex_tensor/complex_tensor.test_complex_tensor-57b4d0d1903d84ca.xml (deflated 28%) 2025-12-04T10:27:53.8323601Z adding: test/test-reports/python-unittest/test_autoload/TEST-TestDeviceBackendAutoload-20251204101548.xml (deflated 43%) 2025-12-04T10:27:53.8324324Z adding: test/test-reports/python-unittest/test_autoload/TEST-TestDeviceBackendAutoload-20251204101709.xml (deflated 43%) 2025-12-04T10:27:53.8369035Z ##[group]Run # Remove any previous usage logs if they exist 2025-12-04T10:27:53.8369355Z # Remove any previous usage logs if they exist 2025-12-04T10:27:53.8369614Z rm -f logs-*.zip 2025-12-04T10:27:53.8369868Z zip "logs-${FILE_SUFFIX}.zip" 'usage_log.txt' || true 2025-12-04T10:27:53.8370227Z zip -r "logs-${FILE_SUFFIX}.zip" test/test-reports -i '*.log' || true 2025-12-04T10:27:53.8377372Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T10:27:53.8377661Z env: 2025-12-04T10:27:53.8377815Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:53.8377995Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:53.8378216Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:53.8378611Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:53.8378967Z DEVICE_NAME: 2025-12-04T10:27:53.8379287Z DEVICE_TYPE: 2025-12-04T10:27:53.8379612Z FILE_SUFFIX: test-default-1-7-linux.g6.4xlarge.experimental.nvidia.gpu_57120265563 2025-12-04T10:27:53.8379968Z ##[endgroup] 2025-12-04T10:27:53.8449455Z adding: usage_log.txt (deflated 58%) 2025-12-04T10:27:53.8484213Z adding: test/test-reports/lazy.test_ts_opinfo_1.1_4f78b575fb718f5e_.log (deflated 49%) 2025-12-04T10:27:53.8484823Z adding: test/test-reports/lazy.test_ts_opinfo_1.1_b1841d9006e1882f_.log (deflated 49%) 2025-12-04T10:27:53.9308845Z adding: test/test-reports/inductor.test_flex_attention_1.6_ddac0a72250f3643_.log (deflated 98%) 2025-12-04T10:27:53.9309550Z adding: test/test-reports/inductor.test_flex_attention_3.6_66a4e481ecf1862e_.log (deflated 50%) 2025-12-04T10:27:53.9310202Z adding: test/test-reports/inductor.test_flex_attention_4.6_e5d890032a85dd23_.log (deflated 50%) 2025-12-04T10:27:53.9310847Z adding: test/test-reports/inductor.test_flex_attention_5.6_2f9a5215a30f13bf_.log (deflated 50%) 2025-12-04T10:27:53.9312563Z adding: test/test-reports/inductor.test_flex_attention_6.6_5a3a1f34f66362bd_.log (deflated 96%) 2025-12-04T10:27:53.9313249Z adding: test/test-reports/test_privateuseone_python_backend_1.1_e6e06e88a3ef7cfe_.log (deflated 51%) 2025-12-04T10:27:53.9313907Z adding: test/test-reports/test_ci_sanity_check_fail_1.1_p62r_fi1_toprint.log (deflated 50%) 2025-12-04T10:27:53.9316871Z adding: test/test-reports/test_overrides_1.1_02cd6dd5329f6857_.log (deflated 98%) 2025-12-04T10:27:53.9317369Z adding: test/test-reports/inductor.test_max_autotune_1.1_6e5671ef4b4366ba_.log (deflated 34%) 2025-12-04T10:27:53.9317892Z adding: test/test-reports/inductor.test_cutlass_backend_1.1_010b9ec4a3497121_.log (deflated 33%) 2025-12-04T10:27:53.9319909Z adding: test/test-reports/inductor.test_benchmark_fusion_1.1_785876950c2bc41a_.log (deflated 97%) 2025-12-04T10:27:53.9320485Z adding: test/test-reports/inductor.test_distributed_patterns_1.1_58512cfe279ad5e4_.log (deflated 51%) 2025-12-04T10:27:53.9321038Z adding: test/test-reports/dynamo.test_fake_distributed_1.1_a5e8149ee594be6f_.log (deflated 50%) 2025-12-04T10:27:53.9321530Z adding: test/test-reports/test_sort_and_select_1.1_2c2b3dd622ee4cd1_.log (deflated 49%) 2025-12-04T10:27:53.9321993Z adding: test/test-reports/test_cpp_api_parity_1.1_fb498b7352b18758_.log (deflated 49%) 2025-12-04T10:27:53.9322457Z adding: test/test-reports/test_extension_utils_1.1_f7778f929dca92b8_.log (deflated 49%) 2025-12-04T10:27:53.9322908Z adding: test/test-reports/test_show_pickle_1.1_d14d40ffea3c45e9_.log (deflated 48%) 2025-12-04T10:27:53.9325359Z adding: test/test-reports/test_torch_1.1_c5508ce831427b28_.log (deflated 95%) 2025-12-04T10:27:53.9325795Z adding: test/test-reports/test_tensorexpr_1.1_382c0bca4aee7904_.log (deflated 48%) 2025-12-04T10:27:53.9327492Z adding: test/test-reports/test_utils_1.1_17124a5ce703c95e_.log (deflated 96%) 2025-12-04T10:27:53.9327957Z adding: test/test-reports/test_namedtuple_return_api_1.1_106189467b589eb1_.log (deflated 50%) 2025-12-04T10:27:53.9329159Z adding: test/test-reports/test_fake_tensor_1.1_e3cb41e76a7ffef1_.log (deflated 90%) 2025-12-04T10:27:53.9330916Z adding: test/test-reports/test_multiprocessing_1.1_c396cb0e4a333e9f_.log (deflated 95%) 2025-12-04T10:27:53.9332231Z adding: test/test-reports/test_fx_1.1_56e3136b301d1666_.log (deflated 94%) 2025-12-04T10:27:53.9332668Z adding: test/test-reports/test_autograd_fallback_1.1_54a485e9b165ff35_.log (deflated 49%) 2025-12-04T10:27:53.9333474Z adding: test/test-reports/test_autocast_1.1_8067b1042af94705_.log (deflated 48%) 2025-12-04T10:27:53.9333923Z adding: test/test-reports/test_python_dispatch_1.1_65235bc5900c6671_.log (deflated 49%) 2025-12-04T10:27:53.9334382Z adding: test/test-reports/test_jit_disabled_1.1_9acfb14806c2339e_.log (deflated 49%) 2025-12-04T10:27:53.9334875Z adding: test/test-reports/test_cpp_extensions_mtia_backend_1.1_47e5defbe240dd5e_.log (deflated 51%) 2025-12-04T10:27:53.9335561Z adding: test/test-reports/functorch.test_memory_efficient_fusion_1.1_3d166d53ca5578b9_.log (deflated 51%) 2025-12-04T10:27:53.9336127Z adding: test/test-reports/test_tensor_creation_ops_1.1_7a72945e9c8beebc_.log (deflated 50%) 2025-12-04T10:27:53.9336663Z adding: test/test-reports/test_cpp_extensions_stream_and_event_1.1_2d77e458babddace_.log (deflated 51%) 2025-12-04T10:27:53.9337163Z adding: test/test-reports/test_dispatch_1.1_fd773829162dcd6a_.log (deflated 48%) 2025-12-04T10:27:53.9337810Z adding: test/test-reports/nn.test_convolution_1.1_2ecf4aa97a43dda6_.log (deflated 95%) 2025-12-04T10:27:53.9338364Z adding: test/test-reports/test_cpp_extensions_jit_1.1_1167a044bc330c16_.log (deflated 50%) 2025-12-04T10:27:53.9340655Z adding: test/test-reports/test_nn_1.1_0bfb94cdb04087aa_.log (deflated 97%) 2025-12-04T10:27:53.9341124Z adding: test/test-reports/test_multiprocessing_spawn_1.1_9fb1f4fec5b6e0d2_.log (deflated 50%) 2025-12-04T10:27:53.9341592Z adding: test/test-reports/nn.test_pooling_1.1_02768dc568b09226_.log (deflated 48%) 2025-12-04T10:27:53.9342029Z adding: test/test-reports/test_cuda_trace_1.1_d047b12230bdbed1_.log (stored 0%) 2025-12-04T10:27:53.9342462Z adding: test/test-reports/test_native_mha_1.1_1ef09fc6539df3bd_.log (deflated 48%) 2025-12-04T10:27:53.9342916Z adding: test/test-reports/test_cuda_nvml_based_avail_1.1_cc197c973db74fb9_.log (stored 0%) 2025-12-04T10:27:53.9344624Z adding: test/test-reports/test_mobile_optimizer_1.1_4839ede4d61f3b89_.log (deflated 97%) 2025-12-04T10:27:53.9345110Z adding: test/test-reports/test_cuda_primary_ctx_1.1_f78fb2d6a682ee44_.log (stored 0%) 2025-12-04T10:27:53.9345562Z adding: test/test-reports/test_reductions_1.1_474a3edd9482342d_.log (deflated 48%) 2025-12-04T10:27:53.9346002Z adding: test/test-reports/test_spectral_ops_1.1_68b862ae55c7c6af_.log (deflated 49%) 2025-12-04T10:27:53.9347162Z adding: test/test-reports/distributions.test_distributions_1.1_ced7167d7dfd0dab_.log (deflated 95%) 2025-12-04T10:27:53.9347942Z adding: test/test-reports/test_cpp_extensions_aot_ninja_1.1_26fdd18b2d333d72_.log (deflated 50%) 2025-12-04T10:27:53.9348599Z adding: test/test-reports/test_cpp_extensions_aot_no_ninja_1.1_066b94fa818468e5_.log (deflated 51%) 2025-12-04T10:27:53.9349275Z adding: test/test-reports/inductor.test_collective_autotuning_1.1_550d9541b7e790e0_.log (deflated 52%) 2025-12-04T10:27:53.9349909Z adding: test/test-reports/inductor.test_halide_1.1_1cd6f628b6d78e53_.log (deflated 7%) 2025-12-04T10:27:53.9350517Z adding: test/test-reports/inductor.test_aot_inductor_utils_1.1_64c4865198c449ee_.log (deflated 51%) 2025-12-04T10:27:53.9351186Z adding: test/test-reports/dynamo.test_graph_region_tracker_1.1_9fc1ac46ca6a3092_.log (deflated 51%) 2025-12-04T10:27:53.9351809Z adding: test/test-reports/dynamo.test_unittest_1.1_5b6ad9c3fddc9671_.log (deflated 50%) 2025-12-04T10:27:53.9352398Z adding: test/test-reports/inductor.test_compile_1.1_4a12187e152c59f0_.log (deflated 50%) 2025-12-04T10:27:53.9352978Z adding: test/test-reports/dynamo.test_functions_1.1_330f01649095c7d8_.log (deflated 50%) 2025-12-04T10:27:53.9353583Z adding: test/test-reports/inductor.test_ordered_set_1.1_0d7f7a7fdedccd6f_.log (deflated 50%) 2025-12-04T10:27:53.9354250Z adding: test/test-reports/dynamo.test_install_free_tensors_1.1_afcc566d5ba882b3_.log (deflated 51%) 2025-12-04T10:27:53.9355001Z adding: test/test-reports/inductor.test_torchinductor_codegen_config_overrides_1.1_7ffeab4b4f5448ff_.log (deflated 54%) 2025-12-04T10:27:53.9355957Z adding: test/test-reports/export.test_passes_1.1_9a890949cbdad883_.log (deflated 49%) 2025-12-04T10:27:53.9356544Z adding: test/test-reports/dynamo.test_autograd_function_1.1_bef268133c355af5_.log (deflated 51%) 2025-12-04T10:27:53.9357068Z adding: test/test-reports/inductor.test_codecache_1.1_fbe410a98ef19d73_.log (deflated 53%) 2025-12-04T10:27:53.9357597Z adding: test/test-reports/complex_tensor.test_complex_tensor_2.3_7c46d523192cf8e5_.log (deflated 58%) 2025-12-04T10:27:53.9358115Z adding: test/test-reports/optim.test_lrscheduler_1.1_33ec11e4104f54ed_.log (deflated 7%) 2025-12-04T10:27:53.9358737Z adding: test/test-reports/inductor.test_collective_autotuning_1.1_1c394bc2a0574cb0_.log (deflated 52%) 2025-12-04T10:27:53.9359250Z adding: test/test-reports/inductor.test_halide_1.1_58d4617b51948353_.log (deflated 6%) 2025-12-04T10:27:53.9359760Z adding: test/test-reports/inductor.test_aot_inductor_utils_1.1_4806681524bae620_.log (deflated 51%) 2025-12-04T10:27:53.9360291Z adding: test/test-reports/dynamo.test_graph_region_tracker_1.1_9d09536ee8be792f_.log (deflated 51%) 2025-12-04T10:27:53.9360849Z adding: test/test-reports/dynamo.test_unittest_1.1_c9f1cab89a7c66a5_.log (deflated 49%) 2025-12-04T10:27:53.9361323Z adding: test/test-reports/inductor.test_compile_1.1_a8a66c5b22feb377_.log (deflated 50%) 2025-12-04T10:27:53.9361794Z adding: test/test-reports/dynamo.test_functions_1.1_f1b0f3ce8ba833d3_.log (deflated 49%) 2025-12-04T10:27:53.9362274Z adding: test/test-reports/inductor.test_ordered_set_1.1_49f52951685697ab_.log (deflated 50%) 2025-12-04T10:27:53.9362782Z adding: test/test-reports/dynamo.test_install_free_tensors_1.1_7d2bd3620c394d13_.log (deflated 51%) 2025-12-04T10:27:53.9363269Z adding: test/test-reports/export.test_passes_1.1_06978b298f6392da_.log (deflated 49%) 2025-12-04T10:27:53.9363826Z adding: test/test-reports/inductor.test_torchinductor_codegen_config_overrides_1.1_d73cc8b4d15ded76_.log (deflated 53%) 2025-12-04T10:27:53.9364425Z adding: test/test-reports/dynamo.test_autograd_function_1.1_d50170967ccb6cc8_.log (deflated 50%) 2025-12-04T10:27:53.9364961Z adding: test/test-reports/complex_tensor.test_complex_tensor_2.3_8c97df55eaaa8b55_.log (deflated 58%) 2025-12-04T10:27:53.9365478Z adding: test/test-reports/inductor.test_codecache_1.1_83e4acb8e97ccfe4_.log (deflated 52%) 2025-12-04T10:27:53.9365974Z adding: test/test-reports/optim.test_lrscheduler_1.1_5edfc3f4cf508994_.log (deflated 7%) 2025-12-04T10:27:53.9388099Z ##[group]Run # Remove any previous debugging artifacts if they exist 2025-12-04T10:27:53.9388587Z # Remove any previous debugging artifacts if they exist 2025-12-04T10:27:53.9388876Z rm -f debug-*.zip 2025-12-04T10:27:53.9389074Z if [ -d 'test/debug' ]; then 2025-12-04T10:27:53.9389333Z  zip -r "debug-${FILE_SUFFIX}.zip" test/debug 2025-12-04T10:27:53.9389576Z fi 2025-12-04T10:27:53.9396479Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T10:27:53.9396751Z env: 2025-12-04T10:27:53.9396913Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:53.9397111Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:53.9397339Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:53.9397737Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:53.9398094Z DEVICE_NAME: 2025-12-04T10:27:53.9398250Z DEVICE_TYPE: 2025-12-04T10:27:53.9398548Z FILE_SUFFIX: test-default-1-7-linux.g6.4xlarge.experimental.nvidia.gpu_57120265563 2025-12-04T10:27:53.9398902Z ##[endgroup] 2025-12-04T10:27:53.9516709Z ##[group]Run seemethere/upload-artifact-s3@v5 2025-12-04T10:27:53.9516959Z with: 2025-12-04T10:27:53.9517118Z s3-bucket: gha-artifacts 2025-12-04T10:27:53.9517349Z s3-prefix: pytorch/pytorch/19922826259/1/artifact 2025-12-04T10:27:53.9517607Z retention-days: 14 2025-12-04T10:27:53.9517799Z if-no-files-found: warn 2025-12-04T10:27:53.9517994Z path: test-jsons-*.zip 2025-12-04T10:27:53.9518166Z name: artifact 2025-12-04T10:27:53.9518322Z region: us-east-1 2025-12-04T10:27:53.9518570Z env: 2025-12-04T10:27:53.9518712Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:53.9518897Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:53.9519119Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:53.9519510Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:53.9519879Z DEVICE_NAME: 2025-12-04T10:27:53.9520034Z DEVICE_TYPE: 2025-12-04T10:27:53.9520181Z ##[endgroup] 2025-12-04T10:27:54.4645622Z NOTE: s3-prefix specified, ignoring name parameter 2025-12-04T10:27:54.4646460Z With the provided path, there will be 1 file uploaded 2025-12-04T10:27:54.4646892Z Uploading to s3 prefix: pytorch/pytorch/19922826259/1/artifact 2025-12-04T10:27:54.4714936Z Starting upload of test-jsons-test-default-1-7-linux.g6.4xlarge.experimental.nvidia.gpu_57120265563.zip 2025-12-04T10:27:54.6554821Z Finished upload of test-jsons-test-default-1-7-linux.g6.4xlarge.experimental.nvidia.gpu_57120265563.zip 2025-12-04T10:27:54.6773949Z ##[group]Run seemethere/upload-artifact-s3@v5 2025-12-04T10:27:54.6774293Z with: 2025-12-04T10:27:54.6774453Z s3-bucket: gha-artifacts 2025-12-04T10:27:54.6774693Z s3-prefix: pytorch/pytorch/19922826259/1/artifact 2025-12-04T10:27:54.6774932Z retention-days: 14 2025-12-04T10:27:54.6775114Z if-no-files-found: error 2025-12-04T10:27:54.6775306Z path: test-reports-*.zip 2025-12-04T10:27:54.6775480Z name: artifact 2025-12-04T10:27:54.6775643Z region: us-east-1 2025-12-04T10:27:54.6775803Z env: 2025-12-04T10:27:54.6775962Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:54.6776166Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:54.6776395Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:54.6776796Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:54.6777147Z DEVICE_NAME: 2025-12-04T10:27:54.6777311Z DEVICE_TYPE: 2025-12-04T10:27:54.6777470Z ##[endgroup] 2025-12-04T10:27:55.2500297Z NOTE: s3-prefix specified, ignoring name parameter 2025-12-04T10:27:55.2500774Z With the provided path, there will be 1 file uploaded 2025-12-04T10:27:55.2501178Z Uploading to s3 prefix: pytorch/pytorch/19922826259/1/artifact 2025-12-04T10:27:55.2568870Z Starting upload of test-reports-test-default-1-7-linux.g6.4xlarge.experimental.nvidia.gpu_57120265563.zip 2025-12-04T10:27:55.4040669Z Finished upload of test-reports-test-default-1-7-linux.g6.4xlarge.experimental.nvidia.gpu_57120265563.zip 2025-12-04T10:27:55.4303730Z ##[group]Run seemethere/upload-artifact-s3@v5 2025-12-04T10:27:55.4303973Z with: 2025-12-04T10:27:55.4304161Z s3-bucket: gha-artifacts 2025-12-04T10:27:55.4304405Z s3-prefix: pytorch/pytorch/19922826259/1/artifact 2025-12-04T10:27:55.4304651Z retention-days: 14 2025-12-04T10:27:55.4304828Z if-no-files-found: ignore 2025-12-04T10:27:55.4305020Z path: logs-*.zip 2025-12-04T10:27:55.4305181Z name: artifact 2025-12-04T10:27:55.4305343Z region: us-east-1 2025-12-04T10:27:55.4305502Z env: 2025-12-04T10:27:55.4305651Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:55.4305845Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:55.4306078Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:55.4306466Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:55.4306824Z DEVICE_NAME: 2025-12-04T10:27:55.4306981Z DEVICE_TYPE: 2025-12-04T10:27:55.4307131Z ##[endgroup] 2025-12-04T10:27:55.7221851Z NOTE: s3-prefix specified, ignoring name parameter 2025-12-04T10:27:55.7222290Z With the provided path, there will be 1 file uploaded 2025-12-04T10:27:55.7222727Z Uploading to s3 prefix: pytorch/pytorch/19922826259/1/artifact 2025-12-04T10:27:55.7290527Z Starting upload of logs-test-default-1-7-linux.g6.4xlarge.experimental.nvidia.gpu_57120265563.zip 2025-12-04T10:27:55.8706090Z Finished upload of logs-test-default-1-7-linux.g6.4xlarge.experimental.nvidia.gpu_57120265563.zip 2025-12-04T10:27:55.8948109Z ##[group]Run seemethere/upload-artifact-s3@v5 2025-12-04T10:27:55.8948473Z with: 2025-12-04T10:27:55.8948644Z s3-bucket: gha-artifacts 2025-12-04T10:27:55.8948894Z s3-prefix: pytorch/pytorch/19922826259/1/artifact 2025-12-04T10:27:55.8949137Z retention-days: 14 2025-12-04T10:27:55.8949314Z if-no-files-found: ignore 2025-12-04T10:27:55.8949503Z path: debug-*.zip 2025-12-04T10:27:55.8949661Z name: artifact 2025-12-04T10:27:55.8949831Z region: us-east-1 2025-12-04T10:27:55.8949993Z env: 2025-12-04T10:27:55.8950137Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:55.8950319Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:55.8950706Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:55.8951125Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:55.8951489Z DEVICE_NAME: 2025-12-04T10:27:55.8951648Z DEVICE_TYPE: 2025-12-04T10:27:55.8951803Z ##[endgroup] 2025-12-04T10:27:56.1833317Z No files were found with the provided path: debug-*.zip. No artifacts will be uploaded. 2025-12-04T10:27:56.2087666Z ##[group]Run # shellcheck disable=SC2156 2025-12-04T10:27:56.2088087Z # shellcheck disable=SC2156 2025-12-04T10:27:56.2088506Z find . -iname "core.[1-9]*" -exec docker exec "${DOCKER_CONTAINER_ID}" sh -c "gdb python {} -ex 'bt' -ex 'q'" \; 2025-12-04T10:27:56.2096500Z shell: /usr/bin/bash -e {0} 2025-12-04T10:27:56.2096721Z env: 2025-12-04T10:27:56.2096882Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:56.2097068Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:56.2097297Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:56.2097709Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:56.2098080Z DEVICE_NAME: 2025-12-04T10:27:56.2098235Z DEVICE_TYPE: 2025-12-04T10:27:56.2098390Z ##[endgroup] 2025-12-04T10:27:56.6416928Z Prepare all required actions 2025-12-04T10:27:56.6417305Z Getting action download info 2025-12-04T10:27:56.8231863Z Download action repository 'actions/setup-python@v6' (SHA:83679a892e2d95755f2dac6acb0bfd1e9ac5d548) 2025-12-04T10:27:58.3119692Z ##[group]Run ./.github/actions/upload-utilization-stats 2025-12-04T10:27:58.3119974Z with: 2025-12-04T10:27:58.3120131Z job_id: 57120265563 2025-12-04T10:27:58.3120638Z job_name: linux-jammy-cuda12.8-py3.10-gcc11-debug / test (default, 1, 7, linux.g6.4xlarge.experimental.nvidia.gpu, oncall:debug-build, rerun_disabled... 2025-12-04T10:27:58.3121179Z workflow_name: periodic 2025-12-04T10:27:58.3121371Z workflow_run_id: 19922826259 2025-12-04T10:27:58.3121570Z workflow_attempt: 1 2025-12-04T10:27:58.3121736Z env: 2025-12-04T10:27:58.3121889Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:58.3122080Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:58.3122306Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:58.3122749Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:58.3123120Z DEVICE_NAME: 2025-12-04T10:27:58.3123276Z DEVICE_TYPE: 2025-12-04T10:27:58.3123435Z ##[endgroup] 2025-12-04T10:27:58.3230091Z ##[group]Run actions/setup-python@v6 2025-12-04T10:27:58.3230316Z with: 2025-12-04T10:27:58.3230474Z python-version: 3.10 2025-12-04T10:27:58.3230651Z check-latest: false 2025-12-04T10:27:58.3230930Z token: *** 2025-12-04T10:27:58.3231103Z update-environment: true 2025-12-04T10:27:58.3231318Z allow-prereleases: false 2025-12-04T10:27:58.3231496Z freethreaded: false 2025-12-04T10:27:58.3231660Z env: 2025-12-04T10:27:58.3231805Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:58.3231979Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:58.3232212Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:58.3232613Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:58.3232971Z DEVICE_NAME: 2025-12-04T10:27:58.3233145Z DEVICE_TYPE: 2025-12-04T10:27:58.3233304Z ##[endgroup] 2025-12-04T10:27:58.6400736Z ##[group]Installed versions 2025-12-04T10:27:58.6408746Z Version 3.10 was not found in the local cache 2025-12-04T10:27:58.6552458Z (node:101645) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead. 2025-12-04T10:27:58.6553182Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-12-04T10:27:58.9507889Z ##[error]The version '3.10' with architecture 'x64' was not found for this operating system. The list of all available versions can be found here: https://raw.githubusercontent.com/actions/python-versions/main/versions-manifest.json 2025-12-04T10:27:58.9716117Z ##[group]Run pytorch/test-infra/.github/actions/teardown-linux@main 2025-12-04T10:27:58.9716459Z with: 2025-12-04T10:27:58.9716612Z env: 2025-12-04T10:27:58.9716767Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:58.9716954Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:58.9717184Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:58.9717606Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:58.9717977Z DEVICE_NAME: 2025-12-04T10:27:58.9718260Z DEVICE_TYPE: 2025-12-04T10:27:58.9718417Z ##[endgroup] 2025-12-04T10:27:58.9804754Z ##[group]Run set -eou pipefail 2025-12-04T10:27:58.9804988Z set -eou pipefail 2025-12-04T10:27:58.9805181Z  2025-12-04T10:27:58.9805443Z echo "Holding runner for 2 hours until all ssh sessions have logged out" 2025-12-04T10:27:58.9805772Z for _ in $(seq 1440); do 2025-12-04T10:27:58.9806037Z  # Break if no ssh session exists anymore 2025-12-04T10:27:58.9806275Z  if [ "$(who)" = "" ]; then 2025-12-04T10:27:58.9806480Z  break 2025-12-04T10:27:58.9806644Z  fi 2025-12-04T10:27:58.9806803Z  echo "." 2025-12-04T10:27:58.9806973Z  sleep 5 2025-12-04T10:27:58.9807133Z done 2025-12-04T10:27:58.9815057Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T10:27:58.9815323Z env: 2025-12-04T10:27:58.9815489Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:58.9815684Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:58.9815914Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:58.9816317Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:58.9816677Z DEVICE_NAME: 2025-12-04T10:27:58.9816832Z DEVICE_TYPE: 2025-12-04T10:27:58.9816987Z ##[endgroup] 2025-12-04T10:27:58.9845734Z Holding runner for 2 hours until all ssh sessions have logged out 2025-12-04T10:27:59.0354771Z ##[group]Run # ignore expansion of "docker ps -q" since it could be empty 2025-12-04T10:27:59.0355177Z # ignore expansion of "docker ps -q" since it could be empty 2025-12-04T10:27:59.0355837Z # shellcheck disable=SC2046 2025-12-04T10:27:59.0356097Z docker stop $(docker ps -q) || true 2025-12-04T10:27:59.0356348Z # Prune all of the docker images 2025-12-04T10:27:59.0356577Z docker system prune -af 2025-12-04T10:27:59.0363686Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T10:27:59.0363976Z env: 2025-12-04T10:27:59.0364137Z GIT_DEFAULT_BRANCH: main 2025-12-04T10:27:59.0364334Z HAS_NVIDIA_GPU: true 2025-12-04T10:27:59.0364578Z GPU_FLAG: --gpus all -e NVIDIA_DRIVER_CAPABILITIES=all 2025-12-04T10:27:59.0364981Z DOCKER_CONTAINER_ID: 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:27:59.0365344Z DEVICE_NAME: 2025-12-04T10:27:59.0365503Z DEVICE_TYPE: 2025-12-04T10:27:59.0365663Z ##[endgroup] 2025-12-04T10:28:21.2086598Z 7dec456c8d4c 2025-12-04T10:28:21.9238069Z Deleted Containers: 2025-12-04T10:28:21.9238485Z 7dec456c8d4cb134c5c70ed4f7d52a1ce0548913d2d4b55e9daab9ad0d1fcf70 2025-12-04T10:28:21.9238978Z 2025-12-04T10:28:33.7379772Z Deleted Images: 2025-12-04T10:28:33.7380674Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image:pytorch-linux-jammy-cuda12.8-cudnn9-py3-gcc11-f0cd68561080d537ef3d3d6f81b25a6416ad600a 2025-12-04T10:28:33.7382232Z untagged: 308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/ci-image@sha256:ba21003510dba4bdeed83df81a56fa468e0ee1b612a9445ae1f402a280804f97 2025-12-04T10:28:33.7383684Z deleted: sha256:add7313791033822205cdb3cf32096534b2cfaa4855bd48119b59000bfe00301 2025-12-04T10:28:33.7384455Z deleted: sha256:85a76b7bf29ad34eb76cce6f46af5d49a58b6272f80f983d5c769e82c7749301 2025-12-04T10:28:33.7385183Z deleted: sha256:0882f3ce59ff5ae30195ee4b059fc713e13eda107a3a7814a4616ac9058a30a4 2025-12-04T10:28:33.7385916Z deleted: sha256:64ba5b9344c11a3e4729136076830b90ac4cf1554046edb1bd4f0784b66ebd9b 2025-12-04T10:28:33.7386400Z deleted: sha256:88213c59cf461a65ab9b6cb07b4195dc9d41b5241c152daa002c7b3112e09124 2025-12-04T10:28:33.7386855Z deleted: sha256:4c0f83afa802ffbc05ebaf1aa50e48a2447c7c295549a6dded80ac63437906ca 2025-12-04T10:28:33.7387423Z deleted: sha256:6f7ec74460e8fb070c8209949095ea3be5f4e2fd69c9f750cd39ac4093f5e64b 2025-12-04T10:28:33.7387861Z deleted: sha256:d6928b0d1021b31942fdcb64e5eb4a34682de66e959dd424ed6ed02c29cd706d 2025-12-04T10:28:33.7388299Z deleted: sha256:4e9fbcb1705a6351bb34dd320558752614308636b94fd9ae6f26063e3deadc0a 2025-12-04T10:28:33.7388824Z deleted: sha256:43aabd0201f48712f21758071352dea029b4de37be08b2e2197706856a9ecbf2 2025-12-04T10:28:33.7389248Z deleted: sha256:940a98dec78303f0548beb1033242a45e9097607ef3e55c8b949b69b73d1b95e 2025-12-04T10:28:33.7389679Z deleted: sha256:d2849fa0e0411cf66e4408831d70e38838afb55b11a80c1c4d8aa0ae7dc9ca40 2025-12-04T10:28:33.7390104Z deleted: sha256:14f40d23c20c7e562623f89deb376520296758bc39dd3c77284049b84ebd8a31 2025-12-04T10:28:33.7390539Z deleted: sha256:a8ccba61f90ca097cb391d0f4fbed0d9f821d06b00e28f7332e9e2dcfcbac4ca 2025-12-04T10:28:33.7390993Z deleted: sha256:91b2060d290547d3b517d4a11d994bbe23f4560b5546cb91918ca1828dde6be1 2025-12-04T10:28:33.7391419Z deleted: sha256:b42a184755715dcfead7fad655a127433541d316d9628f5f730ff17ad5f8071c 2025-12-04T10:28:33.7391863Z deleted: sha256:aa5b4f3c9169061dc3c6da0e677e8a86f11ecb0a3f9fb4861ab3d8c04379775c 2025-12-04T10:28:33.7392315Z deleted: sha256:b4dcf450081a48d77fea0a21b8d810a69c03608a595e754fe7d365058d0579b7 2025-12-04T10:28:33.7392761Z deleted: sha256:4f7fe12d3d4f5bf890c7ada4ce16f17a105472aa6509a778f917dcce2f28174b 2025-12-04T10:28:33.7393199Z deleted: sha256:2d1d5a74182594f9a8553df00fdcfc809dba407bcd6700d667f862cbe9d555ce 2025-12-04T10:28:33.7393641Z deleted: sha256:d901e2f5d449aeed16b727bdcc11fc0e0f6c30c8fc5c39ac7eeac8a74d9d176c 2025-12-04T10:28:33.7394081Z deleted: sha256:a04df2603bd12372c6632469a9a81ebc4a8d677452c250672b9692884fa6a452 2025-12-04T10:28:33.7394566Z deleted: sha256:f438a6b52273a552dc3820a55c74c53a62a0eae9f2a7d21b37125add7d71639f 2025-12-04T10:28:33.7395001Z deleted: sha256:d4b09517e9518d709ac98b0ae6f8446ec9ac51688253607b1fca67aa2c87b3f4 2025-12-04T10:28:33.7395429Z deleted: sha256:c1fa38335237f5e7263e39d3d3de98215bcfbbb12b826955c02e149bf68efd13 2025-12-04T10:28:33.7395852Z deleted: sha256:c898d20a30de901fca74d7611663b17ab48e1726a11e031e40548ed16ee81877 2025-12-04T10:28:33.7396294Z deleted: sha256:3baceec7096518fcc10696feba551639d698b3145c2fc09cac927bb60c0fd751 2025-12-04T10:28:33.7396748Z deleted: sha256:5245aaaa3d5c3a19f76b9a6c920bd82d1a0ff5289f87c8c109652089709d9b3b 2025-12-04T10:28:33.7397183Z deleted: sha256:f05cc789b95246938c377f474c41187965b89ceac0250e7d5124bec32153f447 2025-12-04T10:28:33.7397620Z deleted: sha256:07ec4fc008de4e7a2c794ec7094cc72e0d287c04c8b2156163aee0bae147fe2d 2025-12-04T10:28:33.7398071Z deleted: sha256:c6302601ad5fde573c1f8c900250478fca7fdc6907d8fd4fae651b94b4d9264d 2025-12-04T10:28:33.7398513Z deleted: sha256:cc5e955ee1dc54931f02606c5ea87aae14f03b5d764492be611480ab041f2882 2025-12-04T10:28:33.7398954Z deleted: sha256:f21c03518996d98452338f4e80bcfd9b139a1dab155f4830be0d3f623035269f 2025-12-04T10:28:33.7399399Z deleted: sha256:519ca6f1279f7886f25f0005527cfa627deebbc5b7d7cdbfa7ef962bcfc4c26d 2025-12-04T10:28:33.7399827Z deleted: sha256:0ef990495216807d0175b192045be3f617e72331bc373b3434807f41bf69168d 2025-12-04T10:28:33.7400256Z deleted: sha256:7093edf7319e1f0e01654c3224e32c8dede5b948d106e0b9b03cbf0bb1091e33 2025-12-04T10:28:33.7400679Z deleted: sha256:c478161e058e2f4041555c3e880b95ee1ee047938dc58549a3a88135740996ae 2025-12-04T10:28:33.7401185Z deleted: sha256:9bb853b0d938cd7c36a80ce8ee40653f2c0ff92719209b11beb03acc8855ce3e 2025-12-04T10:28:33.7401625Z deleted: sha256:fdf2ace71a78ce6910ef9c4b073c195531da47022443b606bb92dcd6499b6afc 2025-12-04T10:28:33.7402057Z deleted: sha256:576c2b3770d871937d3cfb7014328bcb4bd1aed0c28bc438764b3bfdac4c1ac2 2025-12-04T10:28:33.7402498Z deleted: sha256:878e92b9cb82de09ac14a9d5f3f7bc2411a799b6f54d0d64b78c2bb4d1fdc0fc 2025-12-04T10:28:33.7403038Z deleted: sha256:85c8c3b98b65a6695f988a10cc66c981d73a3ef03eda15b8e14d227b50b56300 2025-12-04T10:28:33.7403490Z deleted: sha256:ce2ab3ba07794f9ee95d6ea7de6dcd3d2aed96561f9a79192dd56ca5bf29313a 2025-12-04T10:28:33.7403927Z deleted: sha256:37a6e12976ca957286977e696e63012ab9821214b0483fe1a48d29dcb280508a 2025-12-04T10:28:33.7404363Z deleted: sha256:cd1d5d3dd7038144ca6fe961c0d4c8e705625ae0c36190ba8b3e9602abedad19 2025-12-04T10:28:33.7404820Z deleted: sha256:0e707276e0be2e0008b86d594fadc0d16444d66c4fb7227c56f144cbb3c2affd 2025-12-04T10:28:33.7405304Z deleted: sha256:22d4aad6a2ada91b341c1225a0f314042b8aeabef7568c5c019709b058bf070b 2025-12-04T10:28:33.7405751Z deleted: sha256:ee4adacf4e0933131d0275eddad406b3c8147e6cf07a292b99f1aff4b5355f33 2025-12-04T10:28:33.7406197Z deleted: sha256:43da0b9e7c0e18403dcb834e53628dc7c970ccb2dbd091878c0d7c0170dbc97f 2025-12-04T10:28:33.7406641Z deleted: sha256:00571684bdcd75beda15eb7d4e79b5458bc914350f9bb4d87fcdc97ad15e0da1 2025-12-04T10:28:33.7407077Z deleted: sha256:41615f09950259f1d75e82ef35b6fc53b18fe71ebff143744cfd51009d04349e 2025-12-04T10:28:33.7407524Z deleted: sha256:75ab34d2eed3c7915467a506ab6dab2711918fbabe94add2fb5c62780221ab0c 2025-12-04T10:28:33.7407968Z deleted: sha256:0a39ef2bebf44c1c3893d1e5fb42dad48b8fac7ca673141267ee967f85455e89 2025-12-04T10:28:33.7408404Z deleted: sha256:9b7d024e48ba1f9824a54597621b1b062cbc4aa41a77d81ca538d6b5c24a612c 2025-12-04T10:28:33.7408836Z deleted: sha256:392257172de6434c271bd93394218a91e9aa86d7c18abc2f2759317b9d5fb6de 2025-12-04T10:28:33.7409253Z deleted: sha256:6c3232860b930866a463a356124fc392c7e5f04895695229257e8c3e8a02711d 2025-12-04T10:28:33.7409684Z deleted: sha256:63dd55b807215e2fa6c715419ac0c5072d02dddc848dbf74bb7e77b906b5eaed 2025-12-04T10:28:33.7410112Z deleted: sha256:07a8738c1b4584db72ed9aa60f5274321eb0ba16263450da3a75df8326ebc25f 2025-12-04T10:28:33.7410541Z deleted: sha256:053fe2965b01281d12040ec1893e0d1aa77362a49ea9a1067402272c69dad9f5 2025-12-04T10:28:33.7410982Z deleted: sha256:7857fb5eb181c4e80262ecab60bdd3c266cf3d1409ceb76c05882609b416a8d3 2025-12-04T10:28:33.7411457Z deleted: sha256:752528477fc99089de3bd2c6da7b30cf34f2e901fe06d8fcfe685b411461e883 2025-12-04T10:28:33.7411889Z deleted: sha256:cce0210e2f4b042601813df03aa294a86b0c668fcfc75f4c63f6fa12b2952e15 2025-12-04T10:28:33.7412331Z deleted: sha256:f2bb405a26705ecd12d21380d26d9355d01db3a2175080fbdb468f2b5a25a76c 2025-12-04T10:28:33.7412788Z deleted: sha256:ad430120d4ffbaf97cd8d6de6ea8eefa4a8f80ec45f0b176c6b26bff0970fd33 2025-12-04T10:28:33.7413229Z deleted: sha256:225a4910baea7cc540ed43eeac75046293800ab0b8e0192b51e991c8cb50bcf3 2025-12-04T10:28:33.7413672Z deleted: sha256:a259945b0c3507f049fbac10fb3d3ffe43d45e83c91b80ae8cd1dafb855ad83c 2025-12-04T10:28:33.7414103Z deleted: sha256:862a98881b1d5adad5c21d01602773b894794097de80964ef8f47bcaadb43255 2025-12-04T10:28:33.7414529Z deleted: sha256:1cf6d3c8b6c2694b79a2d08719594903811c330a36a4c7a8a7153a350b53d292 2025-12-04T10:28:33.7414954Z deleted: sha256:232a1ae8b0fee817ff7838bb5986a2f38377d3b1dbbf5217b576af0f953b0844 2025-12-04T10:28:33.7415394Z deleted: sha256:c72c5705dabd6314423dd7d4fb260a20d5d9886b2ebce60d19e9d78c4a2335c2 2025-12-04T10:28:33.7415827Z deleted: sha256:296734cf81fd92c913884d058908598424ffe072676e38de289bbab83768c7bd 2025-12-04T10:28:33.7416241Z deleted: sha256:7c76040481b889847a1804021aeff07547eaa4ee706d6137db218d497a8fd9c1 2025-12-04T10:28:33.7416837Z deleted: sha256:d5e293f5b354e8cbcc6de893ea72cc632b02d8fdfbb08ec3127c4e9662f3ebff 2025-12-04T10:28:33.7417436Z deleted: sha256:f35a64e429c88e249645090f21fbe7dae108d98e0ab4ea13184f24b3fd66c315 2025-12-04T10:28:33.7417955Z deleted: sha256:ce6ae8d595c8e69115c51b1ce4f9a9158795d7b863b1cb53f21c39a87974d41b 2025-12-04T10:28:33.7418399Z deleted: sha256:8941abaee59400fb9b3a60765fea4a1fc2a6a447467a6d983e84c7f72494a450 2025-12-04T10:28:33.7418862Z deleted: sha256:ef53c29a9a2c2bc80ffdb9bfaf92842436b5755ec1ce828b9d11e5e27d656ea1 2025-12-04T10:28:33.7419311Z deleted: sha256:7a347fb0acb43f1c814f8c8ff21185e8b5cf64d7bc5988cea060f77d906e08b5 2025-12-04T10:28:33.7419834Z deleted: sha256:cc855dc9be79496e15175569dced2d13477e50b077a5fd3945f9bf50018880c1 2025-12-04T10:28:33.7420265Z deleted: sha256:f7a9946ada3d4786658bc0b643808bb32a9a45e4e90e30dc43ee19e2dbe24024 2025-12-04T10:28:33.7420699Z deleted: sha256:c22a9215f62812c1d2e32827f5221ff556c5b6702aadbdab6b87b8293f19635e 2025-12-04T10:28:33.7421131Z deleted: sha256:959a56746620012e37c1def1a83c5afb1e7c0adc59b021a28beb53c24df98032 2025-12-04T10:28:33.7421569Z deleted: sha256:31a0fff0695bf6100c17954be72eab2095b466d559c75c3faf2a17d8c41e6ebe 2025-12-04T10:28:33.7422044Z deleted: sha256:c15e2b5241b9e55af1b2593e544391b4b44d0505e6528e8f12425136e93b424c 2025-12-04T10:28:33.7422470Z deleted: sha256:73974f74b436f39a2fdb6461b1e3f7c3e41c73325776fa71d16b942a5b4a365b 2025-12-04T10:28:33.7422838Z untagged: public.ecr.aws/docker/library/python:3.13 2025-12-04T10:28:33.7423333Z untagged: public.ecr.aws/docker/library/python@sha256:3f986299a7b8b44b0d8cf9bda2b22361ce5c3058ef5d7cb17fb7452506680ab0 2025-12-04T10:28:33.7423912Z deleted: sha256:44438aecfedf7b6086fce506dae0db5ba7fc0027f9b743f1a75a6b5cbc7de70a 2025-12-04T10:28:33.7424367Z deleted: sha256:6f09a1f5d8a107c2532fbd116e75116cb75fa77b1a7d72d3bdf1ac12de152acd 2025-12-04T10:28:33.7424805Z deleted: sha256:fe5f3ac0be086125eb1e3cd10cc33e8e426f4e079381f7ce5a987b626e99fa67 2025-12-04T10:28:33.7425252Z deleted: sha256:79dd2061a22cf919cfc4f1f02704bfda09afadb017265e670ee54441d296c06c 2025-12-04T10:28:33.7425700Z deleted: sha256:9447ad402aafdbee17e999b0ec84ad89c2646dbebf054d469d4f8bee77f66212 2025-12-04T10:28:33.7426151Z deleted: sha256:7a4909f3c1975be52292f53107495ee1b41c17494918767ccedf1cf1688ae318 2025-12-04T10:28:33.7426573Z deleted: sha256:3474923d97f1f498237650a7d51bd4aea37d5e6b9d8a778777920584af5dd560 2025-12-04T10:28:33.7427007Z deleted: sha256:683afd1773444401a9cbd24842ee5d9154a11abb4fab63ddea5c03df788597ee 2025-12-04T10:28:33.7427368Z 2025-12-04T10:28:33.7427460Z Total reclaimed space: 35.55GB 2025-12-04T10:28:33.7498149Z Post job cleanup. 2025-12-04T10:28:33.7527351Z Post job cleanup. 2025-12-04T10:28:33.8619050Z (node:101805) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead. 2025-12-04T10:28:33.8619808Z (Use `node --trace-deprecation ...` to show where the warning was created) 2025-12-04T10:28:33.8790599Z Post job cleanup. 2025-12-04T10:28:33.8845763Z Post job cleanup. 2025-12-04T10:28:33.9775970Z [command]/usr/bin/git version 2025-12-04T10:28:33.9836121Z git version 2.50.1 2025-12-04T10:28:33.9869404Z Copying '/home/ec2-user/.gitconfig' to '/home/ec2-user/actions-runner/_work/_temp/c90b80d8-75a3-442f-b648-e1adcaeb6e4b/.gitconfig' 2025-12-04T10:28:33.9879105Z Temporarily overriding HOME='/home/ec2-user/actions-runner/_work/_temp/c90b80d8-75a3-442f-b648-e1adcaeb6e4b' before making global git config changes 2025-12-04T10:28:33.9879969Z Adding repository directory to the temporary git global config as a safe directory 2025-12-04T10:28:33.9883678Z [command]/usr/bin/git config --global --add safe.directory /home/ec2-user/actions-runner/_work/pytorch/pytorch 2025-12-04T10:28:33.9927672Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand 2025-12-04T10:28:33.9968496Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :" 2025-12-04T10:28:34.0325040Z Entering 'android/libs/fbjni' 2025-12-04T10:28:34.0402475Z Entering 'third_party/FP16' 2025-12-04T10:28:34.0473677Z Entering 'third_party/FXdiv' 2025-12-04T10:28:34.0541405Z Entering 'third_party/NNPACK' 2025-12-04T10:28:34.0612403Z Entering 'third_party/NVTX' 2025-12-04T10:28:34.0684009Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T10:28:34.0756462Z Entering 'third_party/XNNPACK' 2025-12-04T10:28:34.0842090Z Entering 'third_party/aiter' 2025-12-04T10:28:34.0911263Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T10:28:34.0989134Z Entering 'third_party/benchmark' 2025-12-04T10:28:34.1059785Z Entering 'third_party/composable_kernel' 2025-12-04T10:28:34.1148051Z Entering 'third_party/cpp-httplib' 2025-12-04T10:28:34.1219416Z Entering 'third_party/cpuinfo' 2025-12-04T10:28:34.1290103Z Entering 'third_party/cudnn_frontend' 2025-12-04T10:28:34.1361774Z Entering 'third_party/cutlass' 2025-12-04T10:28:34.1442443Z Entering 'third_party/fbgemm' 2025-12-04T10:28:34.1512807Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T10:28:34.1580813Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T10:28:34.1657381Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T10:28:34.1730001Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T10:28:34.1808282Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T10:28:34.1882467Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T10:28:34.1952000Z Entering 'third_party/fbgemm/external/json' 2025-12-04T10:28:34.2024684Z Entering 'third_party/flash-attention' 2025-12-04T10:28:34.2097351Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T10:28:34.2175790Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T10:28:34.2258168Z Entering 'third_party/flatbuffers' 2025-12-04T10:28:34.2331530Z Entering 'third_party/fmt' 2025-12-04T10:28:34.2400292Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T10:28:34.2470635Z Entering 'third_party/gloo' 2025-12-04T10:28:34.2540202Z Entering 'third_party/googletest' 2025-12-04T10:28:34.2610259Z Entering 'third_party/ideep' 2025-12-04T10:28:34.2678666Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T10:28:34.2758065Z Entering 'third_party/ittapi' 2025-12-04T10:28:34.2829723Z Entering 'third_party/kineto' 2025-12-04T10:28:34.2899298Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T10:28:34.2969820Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T10:28:34.3041311Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T10:28:34.3111389Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T10:28:34.3182485Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T10:28:34.3249392Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T10:28:34.3323339Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T10:28:34.3390599Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T10:28:34.3461101Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T10:28:34.3532065Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T10:28:34.3601984Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T10:28:34.3675330Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T10:28:34.3751783Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T10:28:34.3828129Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T10:28:34.3901810Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T10:28:34.3973636Z Entering 'third_party/kleidiai' 2025-12-04T10:28:34.4048132Z Entering 'third_party/mimalloc' 2025-12-04T10:28:34.4121802Z Entering 'third_party/nlohmann' 2025-12-04T10:28:34.4196246Z Entering 'third_party/onnx' 2025-12-04T10:28:34.4287699Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T10:28:34.4363356Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T10:28:34.4433814Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T10:28:34.4502225Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T10:28:34.4572557Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T10:28:34.4640739Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T10:28:34.4712027Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T10:28:34.4781695Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T10:28:34.4850478Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T10:28:34.4919566Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T10:28:34.4990874Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T10:28:34.5063578Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T10:28:34.5150186Z Entering 'third_party/pocketfft' 2025-12-04T10:28:34.5219131Z Entering 'third_party/protobuf' 2025-12-04T10:28:34.5292257Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T10:28:34.5361342Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T10:28:34.5435345Z Entering 'third_party/psimd' 2025-12-04T10:28:34.5505779Z Entering 'third_party/pthreadpool' 2025-12-04T10:28:34.5583572Z Entering 'third_party/pybind11' 2025-12-04T10:28:34.5653156Z Entering 'third_party/python-peachpy' 2025-12-04T10:28:34.5722407Z Entering 'third_party/sleef' 2025-12-04T10:28:34.5791566Z Entering 'third_party/tensorpipe' 2025-12-04T10:28:34.5859962Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T10:28:34.5929077Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T10:28:34.5997913Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T10:28:34.6069340Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T10:28:34.6136887Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T10:28:34.6231988Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader 2025-12-04T10:28:34.6256917Z http.https://github.com/.extraheader 2025-12-04T10:28:34.6266215Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader 2025-12-04T10:28:34.6298646Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :" 2025-12-04T10:28:34.6646886Z Entering 'android/libs/fbjni' 2025-12-04T10:28:34.6694697Z http.https://github.com/.extraheader 2025-12-04T10:28:34.6741737Z Entering 'third_party/FP16' 2025-12-04T10:28:34.6790065Z http.https://github.com/.extraheader 2025-12-04T10:28:34.6832763Z Entering 'third_party/FXdiv' 2025-12-04T10:28:34.6879561Z http.https://github.com/.extraheader 2025-12-04T10:28:34.6922119Z Entering 'third_party/NNPACK' 2025-12-04T10:28:34.6968059Z http.https://github.com/.extraheader 2025-12-04T10:28:34.7011747Z Entering 'third_party/NVTX' 2025-12-04T10:28:34.7057489Z http.https://github.com/.extraheader 2025-12-04T10:28:34.7102059Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T10:28:34.7147323Z http.https://github.com/.extraheader 2025-12-04T10:28:34.7193782Z Entering 'third_party/XNNPACK' 2025-12-04T10:28:34.7238448Z http.https://github.com/.extraheader 2025-12-04T10:28:34.7296261Z Entering 'third_party/aiter' 2025-12-04T10:28:34.7342026Z http.https://github.com/.extraheader 2025-12-04T10:28:34.7383385Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T10:28:34.7427922Z http.https://github.com/.extraheader 2025-12-04T10:28:34.7482500Z Entering 'third_party/benchmark' 2025-12-04T10:28:34.7527564Z http.https://github.com/.extraheader 2025-12-04T10:28:34.7571497Z Entering 'third_party/composable_kernel' 2025-12-04T10:28:34.7616905Z http.https://github.com/.extraheader 2025-12-04T10:28:34.7669420Z Entering 'third_party/cpp-httplib' 2025-12-04T10:28:34.7715711Z http.https://github.com/.extraheader 2025-12-04T10:28:34.7768032Z Entering 'third_party/cpuinfo' 2025-12-04T10:28:34.7814716Z http.https://github.com/.extraheader 2025-12-04T10:28:34.7860670Z Entering 'third_party/cudnn_frontend' 2025-12-04T10:28:34.7906270Z http.https://github.com/.extraheader 2025-12-04T10:28:34.7950971Z Entering 'third_party/cutlass' 2025-12-04T10:28:34.7996758Z http.https://github.com/.extraheader 2025-12-04T10:28:34.8048685Z Entering 'third_party/fbgemm' 2025-12-04T10:28:34.8100069Z http.https://github.com/.extraheader 2025-12-04T10:28:34.8144466Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T10:28:34.8189417Z http.https://github.com/.extraheader 2025-12-04T10:28:34.8232734Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T10:28:34.8278591Z http.https://github.com/.extraheader 2025-12-04T10:28:34.8329957Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T10:28:34.8375262Z http.https://github.com/.extraheader 2025-12-04T10:28:34.8419184Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T10:28:34.8465605Z http.https://github.com/.extraheader 2025-12-04T10:28:34.8518079Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T10:28:34.8562061Z http.https://github.com/.extraheader 2025-12-04T10:28:34.8605736Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T10:28:34.8651391Z http.https://github.com/.extraheader 2025-12-04T10:28:34.8692076Z Entering 'third_party/fbgemm/external/json' 2025-12-04T10:28:34.8736357Z http.https://github.com/.extraheader 2025-12-04T10:28:34.8785930Z Entering 'third_party/flash-attention' 2025-12-04T10:28:34.8834550Z http.https://github.com/.extraheader 2025-12-04T10:28:34.8878388Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T10:28:34.8924519Z http.https://github.com/.extraheader 2025-12-04T10:28:34.8979548Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T10:28:34.9025378Z http.https://github.com/.extraheader 2025-12-04T10:28:34.9079057Z Entering 'third_party/flatbuffers' 2025-12-04T10:28:34.9125402Z http.https://github.com/.extraheader 2025-12-04T10:28:34.9171437Z Entering 'third_party/fmt' 2025-12-04T10:28:34.9216755Z http.https://github.com/.extraheader 2025-12-04T10:28:34.9262261Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T10:28:34.9307647Z http.https://github.com/.extraheader 2025-12-04T10:28:34.9352433Z Entering 'third_party/gloo' 2025-12-04T10:28:34.9397509Z http.https://github.com/.extraheader 2025-12-04T10:28:34.9441134Z Entering 'third_party/googletest' 2025-12-04T10:28:34.9487128Z http.https://github.com/.extraheader 2025-12-04T10:28:34.9531307Z Entering 'third_party/ideep' 2025-12-04T10:28:34.9575611Z http.https://github.com/.extraheader 2025-12-04T10:28:34.9615968Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T10:28:34.9662969Z http.https://github.com/.extraheader 2025-12-04T10:28:34.9714749Z Entering 'third_party/ittapi' 2025-12-04T10:28:34.9760462Z http.https://github.com/.extraheader 2025-12-04T10:28:34.9802728Z Entering 'third_party/kineto' 2025-12-04T10:28:34.9848459Z http.https://github.com/.extraheader 2025-12-04T10:28:34.9891084Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T10:28:34.9936898Z http.https://github.com/.extraheader 2025-12-04T10:28:34.9978312Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T10:28:35.0024805Z http.https://github.com/.extraheader 2025-12-04T10:28:35.0071810Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T10:28:35.0118299Z http.https://github.com/.extraheader 2025-12-04T10:28:35.0165612Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T10:28:35.0209823Z http.https://github.com/.extraheader 2025-12-04T10:28:35.0253825Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T10:28:35.0298703Z http.https://github.com/.extraheader 2025-12-04T10:28:35.0340604Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T10:28:35.0386785Z http.https://github.com/.extraheader 2025-12-04T10:28:35.0434505Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T10:28:35.0479553Z http.https://github.com/.extraheader 2025-12-04T10:28:35.0523507Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T10:28:35.0568920Z http.https://github.com/.extraheader 2025-12-04T10:28:35.0613407Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T10:28:35.0658363Z http.https://github.com/.extraheader 2025-12-04T10:28:35.0704152Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T10:28:35.0747913Z http.https://github.com/.extraheader 2025-12-04T10:28:35.0792494Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T10:28:35.0837998Z http.https://github.com/.extraheader 2025-12-04T10:28:35.0881706Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T10:28:35.0927674Z http.https://github.com/.extraheader 2025-12-04T10:28:35.0974768Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T10:28:35.1019026Z http.https://github.com/.extraheader 2025-12-04T10:28:35.1068553Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T10:28:35.1114689Z http.https://github.com/.extraheader 2025-12-04T10:28:35.1162319Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T10:28:35.1207623Z http.https://github.com/.extraheader 2025-12-04T10:28:35.1253836Z Entering 'third_party/kleidiai' 2025-12-04T10:28:35.1298395Z http.https://github.com/.extraheader 2025-12-04T10:28:35.1342721Z Entering 'third_party/mimalloc' 2025-12-04T10:28:35.1388081Z http.https://github.com/.extraheader 2025-12-04T10:28:35.1431834Z Entering 'third_party/nlohmann' 2025-12-04T10:28:35.1477584Z http.https://github.com/.extraheader 2025-12-04T10:28:35.1523156Z Entering 'third_party/onnx' 2025-12-04T10:28:35.1568793Z http.https://github.com/.extraheader 2025-12-04T10:28:35.1624720Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T10:28:35.1669241Z http.https://github.com/.extraheader 2025-12-04T10:28:35.1716423Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T10:28:35.1763098Z http.https://github.com/.extraheader 2025-12-04T10:28:35.1806657Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T10:28:35.1851479Z http.https://github.com/.extraheader 2025-12-04T10:28:35.1893092Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T10:28:35.1936455Z http.https://github.com/.extraheader 2025-12-04T10:28:35.1978109Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T10:28:35.2024334Z http.https://github.com/.extraheader 2025-12-04T10:28:35.2071892Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T10:28:35.2117263Z http.https://github.com/.extraheader 2025-12-04T10:28:35.2162501Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T10:28:35.2207144Z http.https://github.com/.extraheader 2025-12-04T10:28:35.2250687Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T10:28:35.2297541Z http.https://github.com/.extraheader 2025-12-04T10:28:35.2340358Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T10:28:35.2388179Z http.https://github.com/.extraheader 2025-12-04T10:28:35.2429588Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T10:28:35.2476185Z http.https://github.com/.extraheader 2025-12-04T10:28:35.2520885Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T10:28:35.2566222Z http.https://github.com/.extraheader 2025-12-04T10:28:35.2613490Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T10:28:35.2658798Z http.https://github.com/.extraheader 2025-12-04T10:28:35.2720901Z Entering 'third_party/pocketfft' 2025-12-04T10:28:35.2766330Z http.https://github.com/.extraheader 2025-12-04T10:28:35.2810450Z Entering 'third_party/protobuf' 2025-12-04T10:28:35.2855050Z http.https://github.com/.extraheader 2025-12-04T10:28:35.2900021Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T10:28:35.2946025Z http.https://github.com/.extraheader 2025-12-04T10:28:35.2990549Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T10:28:35.3036007Z http.https://github.com/.extraheader 2025-12-04T10:28:35.3082981Z Entering 'third_party/psimd' 2025-12-04T10:28:35.3127742Z http.https://github.com/.extraheader 2025-12-04T10:28:35.3172213Z Entering 'third_party/pthreadpool' 2025-12-04T10:28:35.3218225Z http.https://github.com/.extraheader 2025-12-04T10:28:35.3261779Z Entering 'third_party/pybind11' 2025-12-04T10:28:35.3307045Z http.https://github.com/.extraheader 2025-12-04T10:28:35.3351676Z Entering 'third_party/python-peachpy' 2025-12-04T10:28:35.3397542Z http.https://github.com/.extraheader 2025-12-04T10:28:35.3440540Z Entering 'third_party/sleef' 2025-12-04T10:28:35.3487121Z http.https://github.com/.extraheader 2025-12-04T10:28:35.3530550Z Entering 'third_party/tensorpipe' 2025-12-04T10:28:35.3576490Z http.https://github.com/.extraheader 2025-12-04T10:28:35.3619144Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T10:28:35.3665929Z http.https://github.com/.extraheader 2025-12-04T10:28:35.3709945Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T10:28:35.3756885Z http.https://github.com/.extraheader 2025-12-04T10:28:35.3799941Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T10:28:35.3844126Z http.https://github.com/.extraheader 2025-12-04T10:28:35.3887997Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T10:28:35.3934580Z http.https://github.com/.extraheader 2025-12-04T10:28:35.3977200Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T10:28:35.4024114Z http.https://github.com/.extraheader 2025-12-04T10:28:35.4095117Z [command]/usr/bin/git config --local --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.4127856Z [command]/usr/bin/git submodule foreach --recursive git config --local --show-origin --name-only --get-regexp remote.origin.url 2025-12-04T10:28:35.4483441Z Entering 'android/libs/fbjni' 2025-12-04T10:28:35.4513193Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config remote.origin.url 2025-12-04T10:28:35.4536776Z Entering 'third_party/FP16' 2025-12-04T10:28:35.4569323Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config remote.origin.url 2025-12-04T10:28:35.4590898Z Entering 'third_party/FXdiv' 2025-12-04T10:28:35.4621150Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config remote.origin.url 2025-12-04T10:28:35.4642073Z Entering 'third_party/NNPACK' 2025-12-04T10:28:35.4672436Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config remote.origin.url 2025-12-04T10:28:35.4693225Z Entering 'third_party/NVTX' 2025-12-04T10:28:35.4723558Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config remote.origin.url 2025-12-04T10:28:35.4745082Z Entering 'third_party/VulkanMemoryAllocator' 2025-12-04T10:28:35.4774383Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config remote.origin.url 2025-12-04T10:28:35.4797133Z Entering 'third_party/XNNPACK' 2025-12-04T10:28:35.4826562Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config remote.origin.url 2025-12-04T10:28:35.4861981Z Entering 'third_party/aiter' 2025-12-04T10:28:35.4894161Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config remote.origin.url 2025-12-04T10:28:35.4914447Z Entering 'third_party/aiter/3rdparty/composable_kernel' 2025-12-04T10:28:35.4943632Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config remote.origin.url 2025-12-04T10:28:35.4974234Z Entering 'third_party/benchmark' 2025-12-04T10:28:35.5005204Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config remote.origin.url 2025-12-04T10:28:35.5025558Z Entering 'third_party/composable_kernel' 2025-12-04T10:28:35.5057491Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config remote.origin.url 2025-12-04T10:28:35.5087549Z Entering 'third_party/cpp-httplib' 2025-12-04T10:28:35.5124589Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config remote.origin.url 2025-12-04T10:28:35.5147038Z Entering 'third_party/cpuinfo' 2025-12-04T10:28:35.5176632Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config remote.origin.url 2025-12-04T10:28:35.5199429Z Entering 'third_party/cudnn_frontend' 2025-12-04T10:28:35.5229994Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config remote.origin.url 2025-12-04T10:28:35.5251362Z Entering 'third_party/cutlass' 2025-12-04T10:28:35.5280702Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config remote.origin.url 2025-12-04T10:28:35.5311177Z Entering 'third_party/fbgemm' 2025-12-04T10:28:35.5341443Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config remote.origin.url 2025-12-04T10:28:35.5364824Z Entering 'third_party/fbgemm/external/asmjit' 2025-12-04T10:28:35.5394458Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config remote.origin.url 2025-12-04T10:28:35.5416210Z Entering 'third_party/fbgemm/external/composable_kernel' 2025-12-04T10:28:35.5444770Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config remote.origin.url 2025-12-04T10:28:35.5472740Z Entering 'third_party/fbgemm/external/cpuinfo' 2025-12-04T10:28:35.5502306Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config remote.origin.url 2025-12-04T10:28:35.5523086Z Entering 'third_party/fbgemm/external/cutlass' 2025-12-04T10:28:35.5551201Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config remote.origin.url 2025-12-04T10:28:35.5580424Z Entering 'third_party/fbgemm/external/googletest' 2025-12-04T10:28:35.5610480Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config remote.origin.url 2025-12-04T10:28:35.5631187Z Entering 'third_party/fbgemm/external/hipify_torch' 2025-12-04T10:28:35.5660875Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config remote.origin.url 2025-12-04T10:28:35.5681477Z Entering 'third_party/fbgemm/external/json' 2025-12-04T10:28:35.5712255Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config remote.origin.url 2025-12-04T10:28:35.5736361Z Entering 'third_party/flash-attention' 2025-12-04T10:28:35.5768448Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config remote.origin.url 2025-12-04T10:28:35.5789837Z Entering 'third_party/flash-attention/csrc/composable_kernel' 2025-12-04T10:28:35.5820930Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config remote.origin.url 2025-12-04T10:28:35.5847162Z Entering 'third_party/flash-attention/csrc/cutlass' 2025-12-04T10:28:35.5877069Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config remote.origin.url 2025-12-04T10:28:35.5908144Z Entering 'third_party/flatbuffers' 2025-12-04T10:28:35.5937621Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config remote.origin.url 2025-12-04T10:28:35.5961074Z Entering 'third_party/fmt' 2025-12-04T10:28:35.5992119Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config remote.origin.url 2025-12-04T10:28:35.6013456Z Entering 'third_party/gemmlowp/gemmlowp' 2025-12-04T10:28:35.6042739Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config remote.origin.url 2025-12-04T10:28:35.6063917Z Entering 'third_party/gloo' 2025-12-04T10:28:35.6093933Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config remote.origin.url 2025-12-04T10:28:35.6116782Z Entering 'third_party/googletest' 2025-12-04T10:28:35.6145715Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config remote.origin.url 2025-12-04T10:28:35.6168803Z Entering 'third_party/ideep' 2025-12-04T10:28:35.6199761Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config remote.origin.url 2025-12-04T10:28:35.6219412Z Entering 'third_party/ideep/mkl-dnn' 2025-12-04T10:28:35.6249975Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config remote.origin.url 2025-12-04T10:28:35.6280226Z Entering 'third_party/ittapi' 2025-12-04T10:28:35.6310930Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config remote.origin.url 2025-12-04T10:28:35.6332048Z Entering 'third_party/kineto' 2025-12-04T10:28:35.6361185Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config remote.origin.url 2025-12-04T10:28:35.6380505Z Entering 'third_party/kineto/libkineto/third_party/dynolog' 2025-12-04T10:28:35.6411906Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config remote.origin.url 2025-12-04T10:28:35.6431826Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/DCGM' 2025-12-04T10:28:35.6461883Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config remote.origin.url 2025-12-04T10:28:35.6485352Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/cpr' 2025-12-04T10:28:35.6514626Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config remote.origin.url 2025-12-04T10:28:35.6535348Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/fmt' 2025-12-04T10:28:35.6564675Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config remote.origin.url 2025-12-04T10:28:35.6585088Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags' 2025-12-04T10:28:35.6613552Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config remote.origin.url 2025-12-04T10:28:35.6632394Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/gflags/doc' 2025-12-04T10:28:35.6663494Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config remote.origin.url 2025-12-04T10:28:35.6686694Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/glog' 2025-12-04T10:28:35.6716428Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config remote.origin.url 2025-12-04T10:28:35.6739345Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/googletest' 2025-12-04T10:28:35.6768898Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config remote.origin.url 2025-12-04T10:28:35.6790804Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/json' 2025-12-04T10:28:35.6821213Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config remote.origin.url 2025-12-04T10:28:35.6844051Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/pfs' 2025-12-04T10:28:35.6874246Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config remote.origin.url 2025-12-04T10:28:35.6898599Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp' 2025-12-04T10:28:35.6929810Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T10:28:35.6949673Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T10:28:35.6982231Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T10:28:35.7006562Z Entering 'third_party/kineto/libkineto/third_party/dynolog/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T10:28:35.7038080Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T10:28:35.7065962Z Entering 'third_party/kineto/libkineto/third_party/fmt' 2025-12-04T10:28:35.7096947Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config remote.origin.url 2025-12-04T10:28:35.7119013Z Entering 'third_party/kineto/libkineto/third_party/googletest' 2025-12-04T10:28:35.7148268Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config remote.origin.url 2025-12-04T10:28:35.7173306Z Entering 'third_party/kleidiai' 2025-12-04T10:28:35.7203728Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config remote.origin.url 2025-12-04T10:28:35.7227064Z Entering 'third_party/mimalloc' 2025-12-04T10:28:35.7258967Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config remote.origin.url 2025-12-04T10:28:35.7280998Z Entering 'third_party/nlohmann' 2025-12-04T10:28:35.7309909Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config remote.origin.url 2025-12-04T10:28:35.7332718Z Entering 'third_party/onnx' 2025-12-04T10:28:35.7361959Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config remote.origin.url 2025-12-04T10:28:35.7396297Z Entering 'third_party/onnx/third_party/pybind11' 2025-12-04T10:28:35.7427721Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config remote.origin.url 2025-12-04T10:28:35.7453239Z Entering 'third_party/opentelemetry-cpp' 2025-12-04T10:28:35.7482823Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config remote.origin.url 2025-12-04T10:28:35.7505775Z Entering 'third_party/opentelemetry-cpp/third_party/benchmark' 2025-12-04T10:28:35.7537330Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config remote.origin.url 2025-12-04T10:28:35.7561706Z Entering 'third_party/opentelemetry-cpp/third_party/googletest' 2025-12-04T10:28:35.7590402Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config remote.origin.url 2025-12-04T10:28:35.7611158Z Entering 'third_party/opentelemetry-cpp/third_party/ms-gsl' 2025-12-04T10:28:35.7640069Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config remote.origin.url 2025-12-04T10:28:35.7661200Z Entering 'third_party/opentelemetry-cpp/third_party/nlohmann-json' 2025-12-04T10:28:35.7690936Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config remote.origin.url 2025-12-04T10:28:35.7712850Z Entering 'third_party/opentelemetry-cpp/third_party/opentelemetry-proto' 2025-12-04T10:28:35.7741165Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config remote.origin.url 2025-12-04T10:28:35.7762993Z Entering 'third_party/opentelemetry-cpp/third_party/opentracing-cpp' 2025-12-04T10:28:35.7792242Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config remote.origin.url 2025-12-04T10:28:35.7812847Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp' 2025-12-04T10:28:35.7841694Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config remote.origin.url 2025-12-04T10:28:35.7861170Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/civetweb' 2025-12-04T10:28:35.7891375Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config remote.origin.url 2025-12-04T10:28:35.7914302Z Entering 'third_party/opentelemetry-cpp/third_party/prometheus-cpp/3rdparty/googletest' 2025-12-04T10:28:35.7943807Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config remote.origin.url 2025-12-04T10:28:35.7967546Z Entering 'third_party/opentelemetry-cpp/tools/vcpkg' 2025-12-04T10:28:35.7997982Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config remote.origin.url 2025-12-04T10:28:35.8038508Z Entering 'third_party/pocketfft' 2025-12-04T10:28:35.8070585Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config remote.origin.url 2025-12-04T10:28:35.8091750Z Entering 'third_party/protobuf' 2025-12-04T10:28:35.8122594Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config remote.origin.url 2025-12-04T10:28:35.8145238Z Entering 'third_party/protobuf/third_party/benchmark' 2025-12-04T10:28:35.8177375Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config remote.origin.url 2025-12-04T10:28:35.8198685Z Entering 'third_party/protobuf/third_party/googletest' 2025-12-04T10:28:35.8229481Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config remote.origin.url 2025-12-04T10:28:35.8253904Z Entering 'third_party/psimd' 2025-12-04T10:28:35.8285672Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config remote.origin.url 2025-12-04T10:28:35.8308165Z Entering 'third_party/pthreadpool' 2025-12-04T10:28:35.8339299Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config remote.origin.url 2025-12-04T10:28:35.8362547Z Entering 'third_party/pybind11' 2025-12-04T10:28:35.8392647Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config remote.origin.url 2025-12-04T10:28:35.8414837Z Entering 'third_party/python-peachpy' 2025-12-04T10:28:35.8447008Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config remote.origin.url 2025-12-04T10:28:35.8469695Z Entering 'third_party/sleef' 2025-12-04T10:28:35.8500303Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config remote.origin.url 2025-12-04T10:28:35.8522243Z Entering 'third_party/tensorpipe' 2025-12-04T10:28:35.8553264Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config remote.origin.url 2025-12-04T10:28:35.8573518Z Entering 'third_party/tensorpipe/third_party/googletest' 2025-12-04T10:28:35.8603070Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config remote.origin.url 2025-12-04T10:28:35.8625107Z Entering 'third_party/tensorpipe/third_party/libnop' 2025-12-04T10:28:35.8655138Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config remote.origin.url 2025-12-04T10:28:35.8676809Z Entering 'third_party/tensorpipe/third_party/libuv' 2025-12-04T10:28:35.8707104Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config remote.origin.url 2025-12-04T10:28:35.8728803Z Entering 'third_party/tensorpipe/third_party/pybind11' 2025-12-04T10:28:35.8759620Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config remote.origin.url 2025-12-04T10:28:35.8778525Z Entering 'third_party/tensorpipe/third_party/pybind11/tools/clang' 2025-12-04T10:28:35.8809351Z file:/home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config remote.origin.url 2025-12-04T10:28:35.8856840Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/android/libs/fbjni/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.8886782Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FP16/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.8916642Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/FXdiv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.8947511Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.8977141Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NVTX/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9005489Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/VulkanMemoryAllocator/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9035049Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/XNNPACK/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9064310Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9092728Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/aiter/modules/3rdparty/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9121207Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9148689Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9177906Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpp-httplib/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9205681Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9233596Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cudnn_frontend/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9260546Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9287498Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9315178Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/asmjit/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9342923Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9371174Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cpuinfo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9396942Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9426300Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9457359Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/hipify_torch/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9488113Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fbgemm/modules/external/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9515100Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9542390Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/composable_kernel/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9568337Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flash-attention/modules/csrc/cutlass/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9594848Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/flatbuffers/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9622769Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9650361Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gemmlowp/gemmlowp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9678889Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/gloo/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9705458Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9735418Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9765121Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ideep/modules/mkl-dnn/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9793425Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/ittapi/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9832898Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9848180Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9876168Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/DCGM/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9905696Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/cpr/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9934694Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9965881Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:35.9994781Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/gflags/modules/doc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0022705Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/glog/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0057173Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0083358Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0112238Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/pfs/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0139453Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0166859Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0194374Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/dynolog/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0223151Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/fmt/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0249365Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kineto/modules/libkineto/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0276881Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/kleidiai/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0305001Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/mimalloc/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0332472Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/nlohmann/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0358653Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0385736Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/onnx/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0412605Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0437985Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0465114Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0492433Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/ms-gsl/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0519478Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/nlohmann-json/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0548081Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentelemetry-proto/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0574243Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/opentracing-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0602703Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0631683Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/civetweb/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0658291Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/third_party/prometheus-cpp/modules/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0686961Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/opentelemetry-cpp/modules/tools/vcpkg/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0714595Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pocketfft/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0742401Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0769649Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/benchmark/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0796888Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/protobuf/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0824493Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/psimd/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0851030Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/NNPACK_deps/pthreadpool/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0880472Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0907089Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/python-peachpy/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0934806Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/sleef/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0963935Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.0990191Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/googletest/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.1016816Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libnop/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.1044932Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/libuv/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.1073828Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.1101859Z [command]/usr/bin/git config --file /home/ec2-user/actions-runner/_work/pytorch/pytorch/.git/modules/third_party/tensorpipe/modules/third_party/pybind11/modules/tools/clang/config --name-only --get-regexp ^includeIf\.gitdir: 2025-12-04T10:28:36.1219388Z A job completed hook has been configured by the self-hosted runner administrator 2025-12-04T10:28:36.1238988Z ##[group]Run '/home/ec2-user/runner-scripts/after_job.sh' 2025-12-04T10:28:36.1245644Z shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} 2025-12-04T10:28:36.1245922Z ##[endgroup] 2025-12-04T10:28:36.1360493Z [!ALERT!] Swap in detected! [!ALERT!] 2025-12-04T10:28:53.2584838Z Cleaning up orphan processes